Visual Context-driven Audio Feature Enhancement for Robust End-to-End
Audio-Visual Speech Recognition
- URL: http://arxiv.org/abs/2207.06020v1
- Date: Wed, 13 Jul 2022 08:07:19 GMT
- Title: Visual Context-driven Audio Feature Enhancement for Robust End-to-End
Audio-Visual Speech Recognition
- Authors: Joanna Hong, Minsu Kim, Daehun Yoo, Yong Man Ro
- Abstract summary: We propose Visual Context-driven Audio Feature Enhancement module (V-CAFE) to enhance the input noisy audio speech with a help of audio-visual correspondence.
The proposed V-CAFE is designed to capture the transition of lip movements, namely visual context and to generate a noise reduction mask by considering the obtained visual context.
The effectiveness of the proposed method is evaluated in noisy speech recognition and overlapped speech recognition experiments using the two largest audio-visual datasets, LRS2 and LRS3.
- Score: 29.05833230733178
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper focuses on designing a noise-robust end-to-end Audio-Visual Speech
Recognition (AVSR) system. To this end, we propose Visual Context-driven Audio
Feature Enhancement module (V-CAFE) to enhance the input noisy audio speech
with a help of audio-visual correspondence. The proposed V-CAFE is designed to
capture the transition of lip movements, namely visual context and to generate
a noise reduction mask by considering the obtained visual context. Through
context-dependent modeling, the ambiguity in viseme-to-phoneme mapping can be
refined for mask generation. The noisy representations are masked out with the
noise reduction mask resulting in enhanced audio features. The enhanced audio
features are fused with the visual features and taken to an encoder-decoder
model composed of Conformer and Transformer for speech recognition. We show the
proposed end-to-end AVSR with the V-CAFE can further improve the
noise-robustness of AVSR. The effectiveness of the proposed method is evaluated
in noisy speech recognition and overlapped speech recognition experiments using
the two largest audio-visual datasets, LRS2 and LRS3.
Related papers
- VHASR: A Multimodal Speech Recognition System With Vision Hotwords [74.94430247036945]
VHASR is a multimodal speech recognition system that uses vision as hotwords to strengthen the model's speech recognition capability.
VHASR can effectively utilize key information in images to enhance the model's speech recognition ability.
arXiv Detail & Related papers (2024-10-01T16:06:02Z) - Robust Audiovisual Speech Recognition Models with Mixture-of-Experts [67.75334989582709]
We introduce EVA, leveraging the mixture-of-Experts for audioVisual ASR to perform robust speech recognition for in-the-wild'' videos.
We first encode visual information into visual tokens sequence and map them into speech space by a lightweight projection.
Experiments show our model achieves state-of-the-art results on three benchmarks.
arXiv Detail & Related papers (2024-09-19T00:08:28Z) - Cooperative Dual Attention for Audio-Visual Speech Enhancement with
Facial Cues [80.53407593586411]
We focus on leveraging facial cues beyond the lip region for robust Audio-Visual Speech Enhancement (AVSE)
We propose a Dual Attention Cooperative Framework, DualAVSE, to ignore speech-unrelated information, capture speech-related information with facial cues, and dynamically integrate it with the audio signal for AVSE.
arXiv Detail & Related papers (2023-11-24T04:30:31Z) - AV-CPL: Continuous Pseudo-Labeling for Audio-Visual Speech Recognition [27.58390468474957]
We introduce continuous pseudo-labeling for audio-visual speech recognition (AV-CPL)
AV-CPL is a semi-supervised method to train an audio-visual speech recognition model on a combination of labeled and unlabeled videos.
Our method uses the same audio-visual model for both supervised training and pseudo-label generation, mitigating the need for external speech recognition models to generate pseudo-labels.
arXiv Detail & Related papers (2023-09-29T16:57:21Z) - AdVerb: Visually Guided Audio Dereverberation [49.958724234969445]
We present AdVerb, a novel audio-visual dereverberation framework.
It uses visual cues in addition to the reverberant sound to estimate clean audio.
arXiv Detail & Related papers (2023-08-23T18:20:59Z) - Hearing Lips in Noise: Universal Viseme-Phoneme Mapping and Transfer for
Robust Audio-Visual Speech Recognition [21.477900473255264]
We propose a noise-invariant visual modality to strengthen robustness of AVSR.
Inspired by human perception mechanism, we propose a universal viseme-phoneme mapping (UniVPM) approach to implement modality transfer.
Our approach achieves the state-of-the-art under various noisy as well as clean conditions.
arXiv Detail & Related papers (2023-06-18T13:53:34Z) - LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders [53.30016986953206]
We propose LA-VocE, a new two-stage approach that predicts mel-spectrograms from noisy audio-visual speech via a transformer-based architecture.
We train and evaluate our framework on thousands of speakers and 11+ different languages, and study our model's ability to adapt to different levels of background noise and speech interference.
arXiv Detail & Related papers (2022-11-20T15:27:55Z) - Egocentric Audio-Visual Noise Suppression [11.113020254726292]
This paper studies audio-visual noise suppression for egocentric videos.
Video camera emulates off-screen speaker's view of the outside world.
We first demonstrate that egocentric visual information is helpful for noise suppression.
arXiv Detail & Related papers (2022-11-07T15:53:12Z) - AVATAR: Unconstrained Audiovisual Speech Recognition [75.17253531162608]
We propose a new sequence-to-sequence AudioVisual ASR TrAnsformeR (AVATAR) trained end-to-end from spectrograms and full-frame RGB.
We demonstrate the contribution of the visual modality on the How2 AV-ASR benchmark, especially in the presence of simulated noise.
We also create a new, real-world test bed for AV-ASR called VisSpeech, which demonstrates the contribution of the visual modality under challenging audio conditions.
arXiv Detail & Related papers (2022-06-15T17:33:19Z) - End-to-end multi-talker audio-visual ASR using an active speaker
attention module [5.9698688193789335]
The paper presents a new approach for end-to-end audio-visual multi-talker speech recognition.
The approach, referred to here as the visual context attention model (VCAM), is important because it uses the available video information to assign decoded text to one of multiple visible faces.
arXiv Detail & Related papers (2022-04-01T18:42:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.