Related papers: Improving Noise Robust Audio-Visual Speech Recognition via Router-Gated Cross-Modal Feature Fusion

Improving Noise Robust Audio-Visual Speech Recognition via Router-Gated Cross-Modal Feature Fusion

URL: http://arxiv.org/abs/2508.18734v1
Date: Tue, 26 Aug 2025 07:05:48 GMT
Title: Improving Noise Robust Audio-Visual Speech Recognition via Router-Gated Cross-Modal Feature Fusion
Authors: DongHoon Lim, YoungChae Kim, Dong-Hyun Kim, Da-Hee Yang, Joon-Hyuk Chang,
Abstract summary: We propose a novel framework that adaptively reweights audio and visual features based on token-level acoustic corruption scores.<n>Using an audio-visual feature fusion-based router, our method down-weights unreliable audio tokens and reinforces visual cues through cross-attention gated in each decoder layer.<n> Experiments on LRS3 demonstrate that our approach achieves an 16.51-42.67% relative reduction in word error rate compared to AV-HuBERT.
Score: 46.072071890391356
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Robust audio-visual speech recognition (AVSR) in noisy environments remains challenging, as existing systems struggle to estimate audio reliability and dynamically adjust modality reliance. We propose router-gated cross-modal feature fusion, a novel AVSR framework that adaptively reweights audio and visual features based on token-level acoustic corruption scores. Using an audio-visual feature fusion-based router, our method down-weights unreliable audio tokens and reinforces visual cues through gated cross-attention in each decoder layer. This enables the model to pivot toward the visual modality when audio quality deteriorates. Experiments on LRS3 demonstrate that our approach achieves an 16.51-42.67% relative reduction in word error rate compared to AV-HuBERT. Ablation studies confirm that both the router and gating mechanism contribute to improved robustness under real-world acoustic noise.

Related papers

Purification Before Fusion: Toward Mask-Free Speech Enhancement for Robust Audio-Visual Speech Recognition [13.50064027453736]
High-noise audio inputs are prone to introducing adverse interference into the feature fusion process.<n>We propose an end-to-end noise-robust AVSR framework coupled with speech enhancement.<n>Our method preserves speech semantic integrity to achieve robust recognition performance.
arXiv Detail & Related papers (2026-01-18T14:46:08Z)
AD-AVSR: Asymmetric Dual-stream Enhancement for Robust Audio-Visual Speech Recognition [2.4842074869626396]
We introduce a new AVSR framework termed AD-AVSR based on bidirectional modality enhancement.<n> Specifically, we first introduce the audio dual-stream encoding strategy to enrich audio representations from multiple perspectives.<n>We adopt a threshold-based selection mechanism to filter out irrelevant or weakly correlated audio-visual pairs.
arXiv Detail & Related papers (2025-08-11T04:23:08Z)
AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection [57.649223695021114]
We present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection.<n>Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy.
arXiv Detail & Related papers (2025-04-06T13:59:16Z)
Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues [75.73217916395386]
We propose a Bidirectional Audio-Visual Decoder (BAVD) with integrated bidirectional bridges. This interaction narrows the modality imbalance, facilitating more effective learning of integrated audio-visual representations. We also present a strategy for audio-visual frame-wise synchrony as fine-grained guidance of BAVD.
arXiv Detail & Related papers (2024-02-04T03:02:35Z)
MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders. Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z)
AdVerb: Visually Guided Audio Dereverberation [49.958724234969445]
We present AdVerb, a novel audio-visual dereverberation framework. It uses visual cues in addition to the reverberant sound to estimate clean audio.
arXiv Detail & Related papers (2023-08-23T18:20:59Z)
Hearing Lips in Noise: Universal Viseme-Phoneme Mapping and Transfer for Robust Audio-Visual Speech Recognition [21.477900473255264]
We propose a noise-invariant visual modality to strengthen robustness of AVSR. Inspired by human perception mechanism, we propose a universal viseme-phoneme mapping (UniVPM) approach to implement modality transfer. Our approach achieves the state-of-the-art under various noisy as well as clean conditions.
arXiv Detail & Related papers (2023-06-18T13:53:34Z)
Should we hard-code the recurrence concept or learn it instead ? Exploring the Transformer architecture for Audio-Visual Speech Recognition [10.74796391075403]
We present a variant of AV Align where the recurrent Long Short-term Memory (LSTM) block is replaced by the more recently proposed Transformer block. We find that Transformers also learn cross-modal monotonic alignments, but suffer from the same visual convergence problems as the LSTM model.
arXiv Detail & Related papers (2020-05-19T09:06:39Z)
Audio-visual Multi-channel Recognition of Overlapped Speech [79.21950701506732]
This paper presents an audio-visual multi-channel overlapped speech recognition system featuring tightly integrated separation front-end and recognition back-end. Experiments suggest that the proposed multi-channel AVSR system outperforms the baseline audio-only ASR system by up to 6.81% (26.83% relative) and 22.22% (56.87% relative) absolute word error rate (WER) reduction on overlapped speech constructed using either simulation or replaying of the lipreading sentence 2 dataset respectively.
arXiv Detail & Related papers (2020-05-18T10:31:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.