Visual Speech Recognition for Languages with Limited Labeled Data using
Automatic Labels from Whisper
- URL: http://arxiv.org/abs/2309.08535v2
- Date: Fri, 12 Jan 2024 07:20:29 GMT
- Title: Visual Speech Recognition for Languages with Limited Labeled Data using
Automatic Labels from Whisper
- Authors: Jeong Hun Yeo, Minsu Kim, Shinji Watanabe, Yong Man Ro
- Abstract summary: This paper proposes a powerful Visual Speech Recognition (VSR) method for multiple languages.
We employ a Whisper model which can conduct both language identification and audio-based speech recognition.
By comparing the performances of VSR models trained on automatic labels and the human-annotated labels, we show that we can achieve similar VSR performance to that of human-annotated labels.
- Score: 96.43501666278316
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes a powerful Visual Speech Recognition (VSR) method for
multiple languages, especially for low-resource languages that have a limited
number of labeled data. Different from previous methods that tried to improve
the VSR performance for the target language by using knowledge learned from
other languages, we explore whether we can increase the amount of training data
itself for the different languages without human intervention. To this end, we
employ a Whisper model which can conduct both language identification and
audio-based speech recognition. It serves to filter data of the desired
languages and transcribe labels from the unannotated, multilingual audio-visual
data pool. By comparing the performances of VSR models trained on automatic
labels and the human-annotated labels, we show that we can achieve similar VSR
performance to that of human-annotated labels even without utilizing human
annotations. Through the automated labeling process, we label large-scale
unlabeled multilingual databases, VoxCeleb2 and AVSpeech, producing 1,002 hours
of data for four low VSR resource languages, French, Italian, Spanish, and
Portuguese. With the automatic labels, we achieve new state-of-the-art
performance on mTEDx in four languages, significantly surpassing the previous
methods. The automatic labels are available online:
https://github.com/JeongHun0716/Visual-Speech-Recognition-for-Low-Resource-Languages
Related papers
- Lip Reading for Low-resource Languages by Learning and Combining General
Speech Knowledge and Language-specific Knowledge [57.38948190611797]
This paper proposes a novel lip reading framework, especially for low-resource languages.
Since low-resource languages do not have enough video-text paired data to train the model, it is regarded as challenging to develop lip reading models for low-resource languages.
arXiv Detail & Related papers (2023-08-18T05:19:03Z) - Scaling Speech Technology to 1,000+ Languages [66.31120979098483]
The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task.
Main ingredients are a new dataset based on readings of publicly available religious texts.
We built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, and a language identification model for 4,017 languages.
arXiv Detail & Related papers (2023-05-22T22:09:41Z) - Learning Cross-lingual Visual Speech Representations [108.68531445641769]
Cross-lingual self-supervised visual representation learning has been a growing research topic in the last few years.
We use the recently-proposed Raw Audio-Visual Speechs (RAVEn) framework to pre-train an audio-visual model with unlabelled data.
Our experiments show that: (1) multi-lingual models with more data outperform monolingual ones, but, when keeping the amount of data fixed, monolingual models tend to reach better performance.
arXiv Detail & Related papers (2023-03-14T17:05:08Z) - LAE: Language-Aware Encoder for Monolingual and Multilingual ASR [87.74794847245536]
A novel language-aware encoder (LAE) architecture is proposed to handle both situations by disentangling language-specific information.
Experiments conducted on Mandarin-English code-switched speech suggest that the proposed LAE is capable of discriminating different languages in frame-level.
arXiv Detail & Related papers (2022-06-05T04:03:12Z) - Effectiveness of text to speech pseudo labels for forced alignment and
cross lingual pretrained models for low resource speech recognition [0.0]
We present an approach to create labelled data for Maithili, Bhojpuri and Dogri.
All data and models are available in open domain.
arXiv Detail & Related papers (2022-03-31T06:12:52Z) - Code Switched and Code Mixed Speech Recognition for Indic languages [0.0]
Training multilingual automatic speech recognition (ASR) systems is challenging because acoustic and lexical information is typically language specific.
We compare the performance of end to end multilingual speech recognition system to the performance of monolingual models conditioned on language identification (LID)
We also propose a similar technique to solve the Code Switched problem and achieve a WER of 21.77 and 28.27 over Hindi-English and Bengali-English respectively.
arXiv Detail & Related papers (2022-03-30T18:09:28Z) - Pseudo-Labeling for Massively Multilingual Speech Recognition [34.295967235026936]
Semi-supervised learning through pseudo-labeling has become a staple of state-of-the-art monolingual speech recognition systems.
We propose a simple pseudo-labeling recipe that works well even with low-resource languages.
arXiv Detail & Related papers (2021-10-30T03:30:17Z) - Magic dust for cross-lingual adaptation of monolingual wav2vec-2.0 [7.378368959253632]
We show that a monolingual wav2vec-2.0 is a good few-shot ASR learner in several languages.
A key finding of this work is that the adapted monolingual wav2vec-2.0 achieves similar performance as the topline multilingual XLSR model.
arXiv Detail & Related papers (2021-10-07T15:29:22Z) - Multilingual Jointly Trained Acoustic and Written Word Embeddings [22.63696520064212]
We extend this idea to multiple low-resource languages.
We jointly train an AWE model and an AGWE model, using phonetically transcribed data from multiple languages.
The pre-trained models can then be used for unseen zero-resource languages, or fine-tuned on data from low-resource languages.
arXiv Detail & Related papers (2020-06-24T19:16:02Z) - That Sounds Familiar: an Analysis of Phonetic Representations Transfer
Across Languages [72.9927937955371]
We use the resources existing in other languages to train a multilingual automatic speech recognition model.
We observe significant improvements across all languages in the multilingual setting, and stark degradation in the crosslingual setting.
Our analysis uncovered that even the phones that are unique to a single language can benefit greatly from adding training data from other languages.
arXiv Detail & Related papers (2020-05-16T22:28:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.