Improving Noise Robust Audio-Visual Speech Recognition via Router-Gated Cross-Modal Feature Fusion
- URL: http://arxiv.org/abs/2508.18734v1
- Date: Tue, 26 Aug 2025 07:05:48 GMT
- Title: Improving Noise Robust Audio-Visual Speech Recognition via Router-Gated Cross-Modal Feature Fusion
- Authors: DongHoon Lim, YoungChae Kim, Dong-Hyun Kim, Da-Hee Yang, Joon-Hyuk Chang,
- Abstract summary: We propose a novel framework that adaptively reweights audio and visual features based on token-level acoustic corruption scores.<n>Using an audio-visual feature fusion-based router, our method down-weights unreliable audio tokens and reinforces visual cues through cross-attention gated in each decoder layer.<n> Experiments on LRS3 demonstrate that our approach achieves an 16.51-42.67% relative reduction in word error rate compared to AV-HuBERT.
- Score: 46.072071890391356
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Robust audio-visual speech recognition (AVSR) in noisy environments remains challenging, as existing systems struggle to estimate audio reliability and dynamically adjust modality reliance. We propose router-gated cross-modal feature fusion, a novel AVSR framework that adaptively reweights audio and visual features based on token-level acoustic corruption scores. Using an audio-visual feature fusion-based router, our method down-weights unreliable audio tokens and reinforces visual cues through gated cross-attention in each decoder layer. This enables the model to pivot toward the visual modality when audio quality deteriorates. Experiments on LRS3 demonstrate that our approach achieves an 16.51-42.67% relative reduction in word error rate compared to AV-HuBERT. Ablation studies confirm that both the router and gating mechanism contribute to improved robustness under real-world acoustic noise.
Related papers
- Purification Before Fusion: Toward Mask-Free Speech Enhancement for Robust Audio-Visual Speech Recognition [13.50064027453736]
High-noise audio inputs are prone to introducing adverse interference into the feature fusion process.<n>We propose an end-to-end noise-robust AVSR framework coupled with speech enhancement.<n>Our method preserves speech semantic integrity to achieve robust recognition performance.
arXiv Detail & Related papers (2026-01-18T14:46:08Z) - AD-AVSR: Asymmetric Dual-stream Enhancement for Robust Audio-Visual Speech Recognition [2.4842074869626396]
We introduce a new AVSR framework termed AD-AVSR based on bidirectional modality enhancement.<n> Specifically, we first introduce the audio dual-stream encoding strategy to enrich audio representations from multiple perspectives.<n>We adopt a threshold-based selection mechanism to filter out irrelevant or weakly correlated audio-visual pairs.
arXiv Detail & Related papers (2025-08-11T04:23:08Z) - AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection [57.649223695021114]
We present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection.<n>Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy.
arXiv Detail & Related papers (2025-04-06T13:59:16Z) - Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues [75.73217916395386]
We propose a Bidirectional Audio-Visual Decoder (BAVD) with integrated bidirectional bridges.
This interaction narrows the modality imbalance, facilitating more effective learning of integrated audio-visual representations.
We also present a strategy for audio-visual frame-wise synchrony as fine-grained guidance of BAVD.
arXiv Detail & Related papers (2024-02-04T03:02:35Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - AdVerb: Visually Guided Audio Dereverberation [49.958724234969445]
We present AdVerb, a novel audio-visual dereverberation framework.
It uses visual cues in addition to the reverberant sound to estimate clean audio.
arXiv Detail & Related papers (2023-08-23T18:20:59Z) - Hearing Lips in Noise: Universal Viseme-Phoneme Mapping and Transfer for
Robust Audio-Visual Speech Recognition [21.477900473255264]
We propose a noise-invariant visual modality to strengthen robustness of AVSR.
Inspired by human perception mechanism, we propose a universal viseme-phoneme mapping (UniVPM) approach to implement modality transfer.
Our approach achieves the state-of-the-art under various noisy as well as clean conditions.
arXiv Detail & Related papers (2023-06-18T13:53:34Z) - Should we hard-code the recurrence concept or learn it instead ?
Exploring the Transformer architecture for Audio-Visual Speech Recognition [10.74796391075403]
We present a variant of AV Align where the recurrent Long Short-term Memory (LSTM) block is replaced by the more recently proposed Transformer block.
We find that Transformers also learn cross-modal monotonic alignments, but suffer from the same visual convergence problems as the LSTM model.
arXiv Detail & Related papers (2020-05-19T09:06:39Z) - Audio-visual Multi-channel Recognition of Overlapped Speech [79.21950701506732]
This paper presents an audio-visual multi-channel overlapped speech recognition system featuring tightly integrated separation front-end and recognition back-end.
Experiments suggest that the proposed multi-channel AVSR system outperforms the baseline audio-only ASR system by up to 6.81% (26.83% relative) and 22.22% (56.87% relative) absolute word error rate (WER) reduction on overlapped speech constructed using either simulation or replaying of the lipreading sentence 2 dataset respectively.
arXiv Detail & Related papers (2020-05-18T10:31:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.