SynthVSR: Scaling Up Visual Speech Recognition With Synthetic
Supervision
- URL: http://arxiv.org/abs/2303.17200v2
- Date: Mon, 3 Apr 2023 06:30:19 GMT
- Title: SynthVSR: Scaling Up Visual Speech Recognition With Synthetic
Supervision
- Authors: Xubo Liu, Egor Lakomkin, Konstantinos Vougioukas, Pingchuan Ma,
Honglie Chen, Ruiming Xie, Morrie Doulaty, Niko Moritz, J\'achym Kol\'a\v{r},
Stavros Petridis, Maja Pantic, Christian Fuegen
- Abstract summary: We study the potential of leveraging synthetic visual data for visual speech recognition (VSR)
Key idea is to leverage a speech-driven lip animation model that generates lip movements conditioned on the input speech.
We evaluate the performance of our approach on the largest public VSR benchmark - Lip Reading Sentences 3 (LRS3)
- Score: 60.54020550732634
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently reported state-of-the-art results in visual speech recognition (VSR)
often rely on increasingly large amounts of video data, while the publicly
available transcribed video datasets are limited in size. In this paper, for
the first time, we study the potential of leveraging synthetic visual data for
VSR. Our method, termed SynthVSR, substantially improves the performance of VSR
systems with synthetic lip movements. The key idea behind SynthVSR is to
leverage a speech-driven lip animation model that generates lip movements
conditioned on the input speech. The speech-driven lip animation model is
trained on an unlabeled audio-visual dataset and could be further optimized
towards a pre-trained VSR model when labeled videos are available. As plenty of
transcribed acoustic data and face images are available, we are able to
generate large-scale synthetic data using the proposed lip animation model for
semi-supervised VSR training. We evaluate the performance of our approach on
the largest public VSR benchmark - Lip Reading Sentences 3 (LRS3). SynthVSR
achieves a WER of 43.3% with only 30 hours of real labeled data, outperforming
off-the-shelf approaches using thousands of hours of video. The WER is further
reduced to 27.9% when using all 438 hours of labeled data from LRS3, which is
on par with the state-of-the-art self-supervised AV-HuBERT method. Furthermore,
when combined with large-scale pseudo-labeled audio-visual data SynthVSR yields
a new state-of-the-art VSR WER of 16.9% using publicly available data only,
surpassing the recent state-of-the-art approaches trained with 29 times more
non-public machine-transcribed video data (90,000 hours). Finally, we perform
extensive ablation studies to understand the effect of each component in our
proposed method.
Related papers
- SynesLM: A Unified Approach for Audio-visual Speech Recognition and Translation via Language Model and Synthetic Data [42.48380346580101]
We present SynesLM, a unified model which can perform three multimodal language understanding tasks.
For zero-shot AV-ASR, SynesLM achieved SOTA performance by lowering the Word Error Rate (WER) from 43.4% to 39.4%.
Our results in VST and VMT outperform the previous results, improving the BLEU score to 43.5 from 37.2 for VST, and to 54.8 from 54.4 for VMT.
arXiv Detail & Related papers (2024-08-01T15:09:32Z) - BRAVEn: Improving Self-Supervised Pre-training for Visual and Auditory Speech Recognition [72.51848069125822]
We propose BRAVEn, an extension to the RAVEn method, which learns speech representations entirely from raw audio-visual data.
Our modifications to RAVEn enable BRAVEn to achieve state-of-the-art results among self-supervised methods.
Our results suggest that readily available unlabelled audio-visual data can largely replace costly transcribed data.
arXiv Detail & Related papers (2024-04-02T16:48:20Z) - LiteVSR: Efficient Visual Speech Recognition by Learning from Speech
Representations of Unlabeled Data [9.049193356646635]
Our method distills knowledge from a trained Conformer-based ASR model, achieving competitive performance on standard VSR benchmarks.
Our model can be trained on a single consumer-grade GPU within a few days and is capable of performing real-time end-to-end VSR on dated hardware.
arXiv Detail & Related papers (2023-12-15T12:04:24Z) - Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels [100.43280310123784]
We investigate the use of automatically-generated transcriptions of unlabelled datasets to increase the training set size.
We demonstrate that increasing the size of the training set, a recent trend in the literature, leads to reduced WER despite using noisy transcriptions.
The proposed model achieves new state-of-the-art performance on AV-ASR on LRS2 and LRS3.
arXiv Detail & Related papers (2023-03-25T00:37:34Z) - Audio-Visual Efficient Conformer for Robust Speech Recognition [91.3755431537592]
We propose to improve the noise of the recently proposed Efficient Conformer Connectionist Temporal Classification architecture by processing both audio and visual modalities.
Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps.
arXiv Detail & Related papers (2023-01-04T05:36:56Z) - Jointly Learning Visual and Auditory Speech Representations from Raw
Data [108.68531445641769]
RAVEn is a self-supervised multi-modal approach to jointly learn visual and auditory speech representations.
Our design is asymmetric w.r.t. driven by the inherent differences between video and audio.
RAVEn surpasses all self-supervised methods on visual speech recognition.
arXiv Detail & Related papers (2022-12-12T21:04:06Z) - SVTS: Scalable Video-to-Speech Synthesis [105.29009019733803]
We introduce a scalable video-to-speech framework consisting of two components: a video-to-spectrogram predictor and a pre-trained neural vocoder.
We are the first to show intelligible results on the challenging LRS3 dataset.
arXiv Detail & Related papers (2022-05-04T13:34:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.