Related papers: Real-Time Audio-Visual Speech Enhancement Using Pre-trained Visual Representations

Real-Time Audio-Visual Speech Enhancement Using Pre-trained Visual Representations

URL: http://arxiv.org/abs/2507.21448v2
Date: Mon, 04 Aug 2025 16:20:43 GMT
Title: Real-Time Audio-Visual Speech Enhancement Using Pre-trained Visual Representations
Authors: T. Aleksandra Ma, Sile Yin, Li-Chia Yang, Shuo Zhang,
Abstract summary: This paper presents a real-time audio-visual speech enhancement (AVSE) system, RAVEN.<n>It isolates and enhances the on-screen target speaker while suppressing interfering speakers and background noise.<n>To our knowledge, this is the first open-source implementation of a real-time AVSE system.
Score: 5.130705720747573
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Speech enhancement in audio-only settings remains challenging, particularly in the presence of interfering speakers. This paper presents a simple yet effective real-time audio-visual speech enhancement (AVSE) system, RAVEN, which isolates and enhances the on-screen target speaker while suppressing interfering speakers and background noise. We investigate how visual embeddings learned from audio-visual speech recognition (AVSR) and active speaker detection (ASD) contribute to AVSE across different SNR conditions and numbers of interfering speakers. Our results show concatenating embeddings from AVSR and ASD models provides the greatest improvement in low-SNR, multi-speaker environments, while AVSR embeddings alone perform best in noise-only scenarios. In addition, we develop a real-time streaming system that operates on a computer CPU and we provide a video demonstration and code repository. To our knowledge, this is the first open-source implementation of a real-time AVSE system.

Related papers

Enhancing Audiovisual Speech Recognition through Bifocal Preference Optimization [59.1277150358203]
We propose using a preference optimization strategy to improve speech recognition accuracy for real-world videos.<n>First, we create preference data via simulating common errors that occurred in AV-ASR from two focals.<n>Second, we propose BPO-AVASR, a Bifocal Preference Optimization method to improve AV-ASR models by leveraging both input-side and output-side preference.
arXiv Detail & Related papers (2024-12-26T00:26:45Z)
Robust Active Speaker Detection in Noisy Environments [29.785749048315616]
We formulate a robust active speaker detection (rASD) problem in noisy environments. Existing ASD approaches leverage both audio and visual modalities, but non-speech sounds in the surrounding environment can negatively impact performance. We propose a novel framework that utilizes audio-visual speech separation as guidance to learn noise-free audio features.
arXiv Detail & Related papers (2024-03-27T20:52:30Z)
Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues [75.73217916395386]
We propose a Bidirectional Audio-Visual Decoder (BAVD) with integrated bidirectional bridges. This interaction narrows the modality imbalance, facilitating more effective learning of integrated audio-visual representations. We also present a strategy for audio-visual frame-wise synchrony as fine-grained guidance of BAVD.
arXiv Detail & Related papers (2024-02-04T03:02:35Z)
Visually-Guided Sound Source Separation with Audio-Visual Predictive Coding [57.08832099075793]
Visually-guided sound source separation consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing. This paper presents audio-visual predictive coding (AVPC) to tackle this task in parameter harmonizing and more effective manner. In addition, we develop a valid self-supervised learning strategy for AVPC via co-predicting two audio-visual representations of the same sound source.
arXiv Detail & Related papers (2023-06-19T03:10:57Z)
AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR [79.21857972093332]
We present AVFormer, a method for augmenting audio-only models with visual information, at the same time performing lightweight domain adaptation. We show that these can be trained on a small amount of weakly labelled video data with minimum additional training time and parameters. We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively.
arXiv Detail & Related papers (2023-03-29T07:24:28Z)
LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders [53.30016986953206]
We propose LA-VocE, a new two-stage approach that predicts mel-spectrograms from noisy audio-visual speech via a transformer-based architecture. We train and evaluate our framework on thousands of speakers and 11+ different languages, and study our model's ability to adapt to different levels of background noise and speech interference.
arXiv Detail & Related papers (2022-11-20T15:27:55Z)
Late Audio-Visual Fusion for In-The-Wild Speaker Diarization [33.0046568984949]
We propose an audio-visual diarization model which combines audio-only and visual-centric sub-systems via late fusion. For audio, we show that an attractor-based end-to-end system (EEND-EDA) performs remarkably well when trained with our proposed recipe of a simulated proxy dataset. We also propose an improved version, EEND-EDA++, that uses attention in decoding and a speaker recognition loss during training to better handle the larger number of speakers.
arXiv Detail & Related papers (2022-11-02T17:20:42Z)
Visual Context-driven Audio Feature Enhancement for Robust End-to-End Audio-Visual Speech Recognition [29.05833230733178]
We propose Visual Context-driven Audio Feature Enhancement module (V-CAFE) to enhance the input noisy audio speech with a help of audio-visual correspondence. The proposed V-CAFE is designed to capture the transition of lip movements, namely visual context and to generate a noise reduction mask by considering the obtained visual context. The effectiveness of the proposed method is evaluated in noisy speech recognition and overlapped speech recognition experiments using the two largest audio-visual datasets, LRS2 and LRS3.
arXiv Detail & Related papers (2022-07-13T08:07:19Z)
AVATAR: Unconstrained Audiovisual Speech Recognition [75.17253531162608]
We propose a new sequence-to-sequence AudioVisual ASR TrAnsformeR (AVATAR) trained end-to-end from spectrograms and full-frame RGB. We demonstrate the contribution of the visual modality on the How2 AV-ASR benchmark, especially in the presence of simulated noise. We also create a new, real-world test bed for AV-ASR called VisSpeech, which demonstrates the contribution of the visual modality under challenging audio conditions.
arXiv Detail & Related papers (2022-06-15T17:33:19Z)
AVA-AVD: Audio-visual Speaker Diarization in the Wild [26.97787596025907]
Existing audio-visual diarization datasets are mainly focused on indoor environments like meeting rooms or news studios. We propose a novel Audio-Visual Relation Network (AVR-Net) which introduces an effective modality mask to capture discriminative information based on visibility.
arXiv Detail & Related papers (2021-11-29T11:02:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.