Leveraging Modality-specific Representations for Audio-visual Speech
Recognition via Reinforcement Learning
- URL: http://arxiv.org/abs/2212.05301v1
- Date: Sat, 10 Dec 2022 14:01:54 GMT
- Title: Leveraging Modality-specific Representations for Audio-visual Speech
Recognition via Reinforcement Learning
- Authors: Chen Chen, Yuchen Hu, Qiang Zhang, Heqing Zou, Beier Zhu, and Eng
Siong Chng
- Abstract summary: We propose a reinforcement learning (RL) based framework called MSRL.
We customize a reward function directly related to task-specific metrics.
Experimental results on the LRS3 dataset show that the proposed method achieves state-of-the-art in both clean and various noisy conditions.
- Score: 25.743503223389784
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Audio-visual speech recognition (AVSR) has gained remarkable success for
ameliorating the noise-robustness of speech recognition. Mainstream methods
focus on fusing audio and visual inputs to obtain modality-invariant
representations. However, such representations are prone to over-reliance on
audio modality as it is much easier to recognize than video modality in clean
conditions. As a result, the AVSR model underestimates the importance of visual
stream in face of noise corruption. To this end, we leverage visual
modality-specific representations to provide stable complementary information
for the AVSR task. Specifically, we propose a reinforcement learning (RL) based
framework called MSRL, where the agent dynamically harmonizes
modality-invariant and modality-specific representations in the auto-regressive
decoding process. We customize a reward function directly related to
task-specific metrics (i.e., word error rate), which encourages the MSRL to
effectively explore the optimal integration strategy. Experimental results on
the LRS3 dataset show that the proposed method achieves state-of-the-art in
both clean and various noisy conditions. Furthermore, we demonstrate the better
generality of MSRL system than other baselines when test set contains unseen
noises.
Related papers
- Robust Audiovisual Speech Recognition Models with Mixture-of-Experts [67.75334989582709]
We introduce EVA, leveraging the mixture-of-Experts for audioVisual ASR to perform robust speech recognition for in-the-wild'' videos.
We first encode visual information into visual tokens sequence and map them into speech space by a lightweight projection.
Experiments show our model achieves state-of-the-art results on three benchmarks.
arXiv Detail & Related papers (2024-09-19T00:08:28Z) - Large Language Models Are Strong Audio-Visual Speech Recognition Learners [53.142635674428874]
Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities.
We propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities.
We evaluate our proposed approach on LRS3, the largest public AVSR benchmark, and we achieve new state-of-the-art results for the tasks of ASR and AVSR with a WER of 0.81% and 0.77%, respectively.
arXiv Detail & Related papers (2024-09-18T21:17:27Z) - Multimodal Variational Auto-encoder based Audio-Visual Segmentation [46.67599800471001]
ECMVAE factorizes the representations of each modality with a modality-shared representation and a modality-specific representation.
Our approach leads to a new state-of-the-art for audio-visual segmentation, with a 3.84 mIOU performance leap.
arXiv Detail & Related papers (2023-10-12T13:09:40Z) - Lip2Vec: Efficient and Robust Visual Speech Recognition via
Latent-to-Latent Visual to Audio Representation Mapping [4.271091833712731]
We propose a simple approach, named Lip2Vec that is based on learning a prior model.
The proposed model compares favorably with fully-supervised learning methods on the LRS3 dataset achieving 26 WER.
We believe that reprogramming the VSR as an ASR task narrows the performance gap between the two and paves the way for more flexible formulations of lip reading.
arXiv Detail & Related papers (2023-08-11T12:59:02Z) - Exploring the Integration of Speech Separation and Recognition with
Self-Supervised Learning Representation [83.36685075570232]
This work provides an insightful investigation of speech separation in reverberant and noisy-reverberant scenarios as an ASR front-end.
We explore multi-channel separation methods, mask-based beamforming and complex spectral mapping, as well as the best features to use in the ASR back-end model.
A proposed integration using TF-GridNet-based complex spectral mapping and WavLM-based SSLR achieves a 2.5% word error rate in reverberant WHAMR! test set.
arXiv Detail & Related papers (2023-07-23T05:39:39Z) - Cross-Modal Global Interaction and Local Alignment for Audio-Visual
Speech Recognition [21.477900473255264]
We propose a cross-modal global interaction and local alignment (GILA) approach for audio-visual speech recognition (AVSR)
Specifically, we design a global interaction model to capture the A-V complementary relationship on modality level, as well as a local alignment approach to model the A-V temporal consistency on frame level.
Our GILA outperforms the supervised learning state-of-the-art on public benchmarks LRS3 and LRS2.
arXiv Detail & Related papers (2023-05-16T06:41:25Z) - AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot
AV-ASR [79.21857972093332]
We present AVFormer, a method for augmenting audio-only models with visual information, at the same time performing lightweight domain adaptation.
We show that these can be trained on a small amount of weakly labelled video data with minimum additional training time and parameters.
We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively.
arXiv Detail & Related papers (2023-03-29T07:24:28Z) - VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for
Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model)
The proposed VATLM employs a unified backbone network to model the modality-independent information.
In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z) - Discriminative Multi-modality Speech Recognition [17.296404414250553]
Vision is often used as a complementary modality for audio speech recognition (ASR)
In this paper, we propose a two-stage speech recognition model.
In the first stage, the target voice is separated from background noises with help from the corresponding visual information of lip movements, making the model 'listen' clearly.
At the second stage, the audio modality combines visual modality again to better understand the speech by a MSR sub-network, further improving the recognition rate.
arXiv Detail & Related papers (2020-05-12T07:56:03Z) - How to Teach DNNs to Pay Attention to the Visual Modality in Speech
Recognition [10.74796391075403]
This study investigates the inner workings of AV Align and visualises the audio-visual alignment patterns.
We find that AV Align learns to align acoustic and visual representations of speech at the frame level on TCD-TIMIT in a generally monotonic pattern.
We propose a regularisation method which involves predicting lip-related Action Units from visual representations.
arXiv Detail & Related papers (2020-04-17T13:59:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.