AV-CPL: Continuous Pseudo-Labeling for Audio-Visual Speech Recognition
- URL: http://arxiv.org/abs/2309.17395v1
- Date: Fri, 29 Sep 2023 16:57:21 GMT
- Title: AV-CPL: Continuous Pseudo-Labeling for Audio-Visual Speech Recognition
- Authors: Andrew Rouditchenko, Ronan Collobert, Tatiana Likhomanenko
- Abstract summary: We introduce continuous pseudo-labeling for audio-visual speech recognition (AV-CPL)
AV-CPL is a semi-supervised method to train an audio-visual speech recognition model on a combination of labeled and unlabeled videos.
Our method uses the same audio-visual model for both supervised training and pseudo-label generation, mitigating the need for external speech recognition models to generate pseudo-labels.
- Score: 27.58390468474957
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio-visual speech contains synchronized audio and visual information that
provides cross-modal supervision to learn representations for both automatic
speech recognition (ASR) and visual speech recognition (VSR). We introduce
continuous pseudo-labeling for audio-visual speech recognition (AV-CPL), a
semi-supervised method to train an audio-visual speech recognition (AVSR) model
on a combination of labeled and unlabeled videos with continuously regenerated
pseudo-labels. Our models are trained for speech recognition from audio-visual
inputs and can perform speech recognition using both audio and visual
modalities, or only one modality. Our method uses the same audio-visual model
for both supervised training and pseudo-label generation, mitigating the need
for external speech recognition models to generate pseudo-labels. AV-CPL
obtains significant improvements in VSR performance on the LRS3 dataset while
maintaining practical ASR and AVSR performance. Finally, using visual-only
speech data, our method is able to leverage unlabeled visual speech to improve
VSR.
Related papers
- Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation [55.15299351110525]
This paper explores sentence-level multilingual Visual Speech Recognition (VSR) that can recognize different languages with a single trained model.
We propose a novel training strategy, processing with visual speech units.
We set new state-of-the-art multilingual VSR performances by achieving comparable performances to the previous language-specific VSR models.
arXiv Detail & Related papers (2024-01-18T08:46:02Z) - Lip2Vec: Efficient and Robust Visual Speech Recognition via
Latent-to-Latent Visual to Audio Representation Mapping [4.271091833712731]
We propose a simple approach, named Lip2Vec that is based on learning a prior model.
The proposed model compares favorably with fully-supervised learning methods on the LRS3 dataset achieving 26 WER.
We believe that reprogramming the VSR as an ASR task narrows the performance gap between the two and paves the way for more flexible formulations of lip reading.
arXiv Detail & Related papers (2023-08-11T12:59:02Z) - AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot
AV-ASR [79.21857972093332]
We present AVFormer, a method for augmenting audio-only models with visual information, at the same time performing lightweight domain adaptation.
We show that these can be trained on a small amount of weakly labelled video data with minimum additional training time and parameters.
We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively.
arXiv Detail & Related papers (2023-03-29T07:24:28Z) - AV-data2vec: Self-supervised Learning of Audio-Visual Speech
Representations with Contextualized Target Representations [88.30635799280923]
We introduce AV-data2vec which builds audio-visual representations based on predicting contextualized representations.
Results on LRS3 show that AV-data2vec consistently outperforms existing methods with the same amount of data and model size.
arXiv Detail & Related papers (2023-02-10T02:55:52Z) - VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for
Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model)
The proposed VATLM employs a unified backbone network to model the modality-independent information.
In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z) - Visual Context-driven Audio Feature Enhancement for Robust End-to-End
Audio-Visual Speech Recognition [29.05833230733178]
We propose Visual Context-driven Audio Feature Enhancement module (V-CAFE) to enhance the input noisy audio speech with a help of audio-visual correspondence.
The proposed V-CAFE is designed to capture the transition of lip movements, namely visual context and to generate a noise reduction mask by considering the obtained visual context.
The effectiveness of the proposed method is evaluated in noisy speech recognition and overlapped speech recognition experiments using the two largest audio-visual datasets, LRS2 and LRS3.
arXiv Detail & Related papers (2022-07-13T08:07:19Z) - Lip-Listening: Mixing Senses to Understand Lips using Cross Modality
Knowledge Distillation for Word-Based Models [0.03499870393443267]
This work builds on recent state-of-the-art word-based lipreading models by integrating sequence-level and frame-level Knowledge Distillation (KD) to their systems.
We propose a technique to transfer speech recognition capabilities from audio speech recognition systems to visual speech recognizers, where our goal is to utilize audio data during lipreading model training.
arXiv Detail & Related papers (2022-06-05T15:47:54Z) - End-to-end multi-talker audio-visual ASR using an active speaker
attention module [5.9698688193789335]
The paper presents a new approach for end-to-end audio-visual multi-talker speech recognition.
The approach, referred to here as the visual context attention model (VCAM), is important because it uses the available video information to assign decoded text to one of multiple visible faces.
arXiv Detail & Related papers (2022-04-01T18:42:14Z) - Learning Speech Representations from Raw Audio by Joint Audiovisual
Self-Supervision [63.564385139097624]
We propose a method to learn self-supervised speech representations from the raw audio waveform.
We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio)
Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.
arXiv Detail & Related papers (2020-07-08T14:07:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.