Synchronous Bidirectional Learning for Multilingual Lip Reading
- URL: http://arxiv.org/abs/2005.03846v4
- Date: Fri, 14 Aug 2020 15:34:49 GMT
- Title: Synchronous Bidirectional Learning for Multilingual Lip Reading
- Authors: Mingshuang Luo, Shuang Yang, Xilin Chen, Zitao Liu, Shiguang Shan
- Abstract summary: Lip movements of all languages share similar patterns due to the common structures of human organs.
Phonemes are more closely related with the lip movements than the alphabet letters.
A novel SBL block is proposed to learn the rules for each language in a fill-in-the-blank way.
- Score: 99.14744013265594
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Lip reading has received increasing attention in recent years. This paper
focuses on the synergy of multilingual lip reading. There are about as many as
7000 languages in the world, which implies that it is impractical to train
separate lip reading models with large-scale data for each language. Although
each language has its own linguistic and pronunciation rules, the lip movements
of all languages share similar patterns due to the common structures of human
organs. Based on this idea, we try to explore the synergized learning of
multilingual lip reading in this paper, and further propose a synchronous
bidirectional learning (SBL) framework for effective synergy of multilingual
lip reading. We firstly introduce phonemes as our modeling units for the
multilingual setting here. Phonemes are more closely related with the lip
movements than the alphabet letters. At the same time, similar phonemes always
lead to similar visual patterns no matter which type the target language is.
Then, a novel SBL block is proposed to learn the rules for each language in a
fill-in-the-blank way. Specifically, the model has to learn to infer the target
unit given its bidirectional context, which could represent the composition
rules of phonemes for each language. To make the learning process more targeted
at each particular language, an extra task of predicting the language identity
is introduced in the learning process. Finally, a thorough comparison on LRW
(English) and LRW-1000 (Mandarin) is performed, which shows the promising
benefits from the synergized learning of different languages and also reports a
new state-of-the-art result on both datasets.
Related papers
- Lip Reading for Low-resource Languages by Learning and Combining General
Speech Knowledge and Language-specific Knowledge [57.38948190611797]
This paper proposes a novel lip reading framework, especially for low-resource languages.
Since low-resource languages do not have enough video-text paired data to train the model, it is regarded as challenging to develop lip reading models for low-resource languages.
arXiv Detail & Related papers (2023-08-18T05:19:03Z) - Multilingual context-based pronunciation learning for Text-to-Speech [13.941800219395757]
Phonetic information and linguistic knowledge are an essential component of a Text-to-speech (TTS) front-end.
We showcase a multilingual unified front-end system that addresses any pronunciation related task, typically handled by separate modules.
We find that the multilingual model is competitive across languages and tasks, however, some trade-offs exists when compared to equivalent monolingual solutions.
arXiv Detail & Related papers (2023-07-31T14:29:06Z) - Learning Cross-lingual Visual Speech Representations [108.68531445641769]
Cross-lingual self-supervised visual representation learning has been a growing research topic in the last few years.
We use the recently-proposed Raw Audio-Visual Speechs (RAVEn) framework to pre-train an audio-visual model with unlabelled data.
Our experiments show that: (1) multi-lingual models with more data outperform monolingual ones, but, when keeping the amount of data fixed, monolingual models tend to reach better performance.
arXiv Detail & Related papers (2023-03-14T17:05:08Z) - Examining Cross-lingual Contextual Embeddings with Orthogonal Structural
Probes [0.2538209532048867]
A novel Orthogonal Structural Probe (Limisiewicz and Marevcek, 2021) allows us to answer this question for specific linguistic features.
We evaluate syntactic (UD) and lexical (WordNet) structural information encoded inmBERT's contextual representations for nine diverse languages.
We successfully apply our findings to zero-shot and few-shot cross-lingual parsing.
arXiv Detail & Related papers (2021-09-10T15:03:11Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - Phonological Features for 0-shot Multilingual Speech Synthesis [50.591267188664666]
We show that code-switching is possible for languages unseen during training, even within monolingual models.
We generate intelligible, code-switched speech in a new language at test time, including the approximation of sounds never seen in training.
arXiv Detail & Related papers (2020-08-06T18:25:18Z) - That Sounds Familiar: an Analysis of Phonetic Representations Transfer
Across Languages [72.9927937955371]
We use the resources existing in other languages to train a multilingual automatic speech recognition model.
We observe significant improvements across all languages in the multilingual setting, and stark degradation in the crosslingual setting.
Our analysis uncovered that even the phones that are unique to a single language can benefit greatly from adding training data from other languages.
arXiv Detail & Related papers (2020-05-16T22:28:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.