Lip Reading for Low-resource Languages by Learning and Combining General
Speech Knowledge and Language-specific Knowledge
- URL: http://arxiv.org/abs/2308.09311v2
- Date: Fri, 12 Jan 2024 07:36:45 GMT
- Title: Lip Reading for Low-resource Languages by Learning and Combining General
Speech Knowledge and Language-specific Knowledge
- Authors: Minsu Kim, Jeong Hun Yeo, Jeongsoo Choi, and Yong Man Ro
- Abstract summary: This paper proposes a novel lip reading framework, especially for low-resource languages.
Since low-resource languages do not have enough video-text paired data to train the model, it is regarded as challenging to develop lip reading models for low-resource languages.
- Score: 57.38948190611797
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes a novel lip reading framework, especially for
low-resource languages, which has not been well addressed in the previous
literature. Since low-resource languages do not have enough video-text paired
data to train the model to have sufficient power to model lip movements and
language, it is regarded as challenging to develop lip reading models for
low-resource languages. In order to mitigate the challenge, we try to learn
general speech knowledge, the ability to model lip movements, from a
high-resource language through the prediction of speech units. It is known that
different languages partially share common phonemes, thus general speech
knowledge learned from one language can be extended to other languages. Then,
we try to learn language-specific knowledge, the ability to model language, by
proposing Language-specific Memory-augmented Decoder (LMDecoder). LMDecoder
saves language-specific audio features into memory banks and can be trained on
audio-text paired data which is more easily accessible than video-text paired
data. Therefore, with LMDecoder, we can transform the input speech units into
language-specific audio features and translate them into texts by utilizing the
learned rich language knowledge. Finally, by combining general speech knowledge
and language-specific knowledge, we can efficiently develop lip reading models
even for low-resource languages. Through extensive experiments using five
languages, English, Spanish, French, Italian, and Portuguese, the effectiveness
of the proposed method is evaluated.
Related papers
- Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models [13.855545744177586]
This paper examines the performance of existing audio language models in an underserved language using Thai.
Despite being built on multilingual backbones, audio language models do not exhibit cross-lingual emergent abilities.
This paper integrates audio comprehension and speech instruction-following capabilities into a single unified model.
arXiv Detail & Related papers (2024-09-17T09:04:03Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - Scaling Speech Technology to 1,000+ Languages [66.31120979098483]
The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task.
Main ingredients are a new dataset based on readings of publicly available religious texts.
We built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, and a language identification model for 4,017 languages.
arXiv Detail & Related papers (2023-05-22T22:09:41Z) - Bitext Mining Using Distilled Sentence Representations for Low-Resource
Languages [12.00637655338665]
We study very low-resource languages and handle 50 African languages, many of which are not covered by any other model.
We train sentence encoders, mine bitexts, and validate the bitexts by training NMT systems.
For these languages, we train sentence encoders, mine bitexts, and validate the bitexts by training NMT systems.
arXiv Detail & Related papers (2022-05-25T10:53:24Z) - X-FACTR: Multilingual Factual Knowledge Retrieval from Pretrained
Language Models [103.75890012041366]
Language models (LMs) have proven surprisingly successful at capturing factual knowledge.
However, studies on LMs' factual representation ability have almost invariably been performed on English.
We create a benchmark of cloze-style probes for 23 typologically diverse languages.
arXiv Detail & Related papers (2020-10-13T05:29:56Z) - That Sounds Familiar: an Analysis of Phonetic Representations Transfer
Across Languages [72.9927937955371]
We use the resources existing in other languages to train a multilingual automatic speech recognition model.
We observe significant improvements across all languages in the multilingual setting, and stark degradation in the crosslingual setting.
Our analysis uncovered that even the phones that are unique to a single language can benefit greatly from adding training data from other languages.
arXiv Detail & Related papers (2020-05-16T22:28:09Z) - Meta-Transfer Learning for Code-Switched Speech Recognition [72.84247387728999]
We propose a new learning method, meta-transfer learning, to transfer learn on a code-switched speech recognition system in a low-resource setting.
Our model learns to recognize individual languages, and transfer them so as to better recognize mixed-language speech by conditioning the optimization on the code-switching data.
arXiv Detail & Related papers (2020-04-29T14:27:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.