Visual Speech Recognition for Multiple Languages in the Wild
- URL: http://arxiv.org/abs/2202.13084v1
- Date: Sat, 26 Feb 2022 07:21:00 GMT
- Title: Visual Speech Recognition for Multiple Languages in the Wild
- Authors: Pingchuan Ma, Stavros Petridis, Maja Pantic
- Abstract summary: We show that designing better VSR models is equally important to using larger training sets.
We propose the addition of prediction-based auxiliary tasks to a VSR model.
We show that such model works for different languages and outperforms all previous methods trained on publicly available datasets by a large margin.
- Score: 64.52593130370757
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Visual speech recognition (VSR) aims to recognise the content of speech based
on the lip movements without relying on the audio stream. Advances in deep
learning and the availability of large audio-visual datasets have led to the
development of much more accurate and robust VSR models than ever before.
However, these advances are usually due to larger training sets rather than the
model design. In this work, we demonstrate that designing better models is
equally important to using larger training sets. We propose the addition of
prediction-based auxiliary tasks to a VSR model and highlight the importance of
hyper-parameter optimisation and appropriate data augmentations. We show that
such model works for different languages and outperforms all previous methods
trained on publicly available datasets by a large margin. It even outperforms
models that were trained on non-publicly available datasets containing up to to
21 times more data. We show furthermore that using additional training data,
even in other languages or with automatically generated transcriptions, results
in further improvement.
Related papers
- Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation [55.15299351110525]
This paper explores sentence-level multilingual Visual Speech Recognition (VSR) that can recognize different languages with a single trained model.
We propose a novel training strategy, processing with visual speech units.
We set new state-of-the-art multilingual VSR performances by achieving comparable performances to the previous language-specific VSR models.
arXiv Detail & Related papers (2024-01-18T08:46:02Z) - Teach me with a Whisper: Enhancing Large Language Models for Analyzing
Spoken Transcripts using Speech Embeddings [8.660203441911554]
We propose a methodology for training language models leveraging spoken language audio data.
This leads to an improved language model for analyzing spoken transcripts while avoiding an audio processing overhead at test time.
In our experiments, the student model achieves consistent improvement over traditional language models on tasks analyzing spoken transcripts.
arXiv Detail & Related papers (2023-11-13T01:53:12Z) - Deepfake audio as a data augmentation technique for training automatic
speech to text transcription models [55.2480439325792]
We propose a framework that approaches data augmentation based on deepfake audio.
A dataset produced by Indians (in English) was selected, ensuring the presence of a single accent.
arXiv Detail & Related papers (2023-09-22T11:33:03Z) - Reduce, Reuse, Recycle: Is Perturbed Data better than Other Language augmentation for Low Resource Self-Supervised Speech Models [48.44820587495038]
Self-supervised representation learning (SSRL) has demonstrated superior performance than supervised models for tasks including phoneme recognition.
Training SSRL models poses a challenge for low-resource languages where sufficient pre-training data may not be available.
We propose to use audio augmentation techniques, namely: pitch variation, noise addition, accented target language and other language speech to pre-train SSRL models in a low resource condition and evaluate phoneme recognition.
arXiv Detail & Related papers (2023-09-22T10:09:09Z) - Learning Cross-lingual Visual Speech Representations [108.68531445641769]
Cross-lingual self-supervised visual representation learning has been a growing research topic in the last few years.
We use the recently-proposed Raw Audio-Visual Speechs (RAVEn) framework to pre-train an audio-visual model with unlabelled data.
Our experiments show that: (1) multi-lingual models with more data outperform monolingual ones, but, when keeping the amount of data fixed, monolingual models tend to reach better performance.
arXiv Detail & Related papers (2023-03-14T17:05:08Z) - Paraphrastic Representations at Scale [134.41025103489224]
We release trained models for English, Arabic, German, French, Spanish, Russian, Turkish, and Chinese languages.
We train these models on large amounts of data, achieving significantly improved performance from the original papers.
arXiv Detail & Related papers (2021-04-30T16:55:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.