Related papers: Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability Scoring

Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability Scoring

URL: http://arxiv.org/abs/2303.08536v2
Date: Mon, 20 Mar 2023 07:01:45 GMT
Title: Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability Scoring
Authors: Joanna Hong, Minsu Kim, Jeongsoo Choi, Yong Man Ro
Abstract summary: This paper deals with Audio-Visual Speech Recognition (AVSR) under multimodal input corruption situations. In real life, clean visual inputs are not always accessible and can even be corrupted by occluded lip regions or noises. We propose a novel AVSR framework, namely Audio-Visual ReliabilityScore module (AV-RelScore), that is robust to the corrupted multimodal inputs.
Score: 29.05833230733178
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper deals with Audio-Visual Speech Recognition (AVSR) under multimodal input corruption situations where audio inputs and visual inputs are both corrupted, which is not well addressed in previous research directions. Previous studies have focused on how to complement the corrupted audio inputs with the clean visual inputs with the assumption of the availability of clean visual inputs. However, in real life, clean visual inputs are not always accessible and can even be corrupted by occluded lip regions or noises. Thus, we firstly analyze that the previous AVSR models are not indeed robust to the corruption of multimodal input streams, the audio and the visual inputs, compared to uni-modal models. Then, we design multimodal input corruption modeling to develop robust AVSR models. Lastly, we propose a novel AVSR framework, namely Audio-Visual Reliability Scoring module (AV-RelScore), that is robust to the corrupted multimodal inputs. The AV-RelScore can determine which input modal stream is reliable or not for the prediction and also can exploit the more reliable streams in prediction. The effectiveness of the proposed method is evaluated with comprehensive experiments on popular benchmark databases, LRS2 and LRS3. We also show that the reliability scores obtained by AV-RelScore well reflect the degree of corruption and make the proposed model focus on the reliable multimodal representations.

Related papers

ERF-BA-TFD+: A Multimodal Model for Audio-Visual Deepfake Detection [49.14187862877009]
We present ERF-BA-TFD+, a novel deepfake detection model that combines enhanced receptive field (ERF) and audio-visual fusion.<n>Our model processes both audio and video features simultaneously, leveraging their complementary information to improve detection accuracy and robustness.<n>We evaluate ERF-BA-TFD+ on the DDL-AV dataset, which consists of both segmented and full-length video clips.
arXiv Detail & Related papers (2025-08-24T10:03:46Z)
AVCD: Mitigating Hallucinations in Audio-Visual Large Language Models through Contrastive Decoding [14.515296731166721]
We propose Audio-Visual Contrastive Decoding (AVCD) to model trimodal interactions and suppress hallucinations in large language models (MLLMs)<n>To improve efficiency, we introduce entropy-guided adaptive decoding, which skips unnecessary decoding steps based on the model's confidence in its predictions.
arXiv Detail & Related papers (2025-05-27T08:13:57Z)
AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection [57.649223695021114]
We present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection. Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy.
arXiv Detail & Related papers (2025-04-06T13:59:16Z)
$C^2$AV-TSE: Context and Confidence-aware Audio Visual Target Speaker Extraction [80.57232374640911]
We propose a model-agnostic strategy called the Mask-And-Recover (MAR) MAR integrates both inter- and intra-modality contextual correlations to enable global inference within extraction modules. To better target challenging parts within each sample, we introduce a Fine-grained Confidence Score (FCS) model.
arXiv Detail & Related papers (2025-04-01T13:01:30Z)
Multi-Task Corrupted Prediction for Learning Robust Audio-Visual Speech Representation [23.406334722946163]
We propose CAV2vec, a novel self-supervised speech representation learning framework to handle audio-visual joint corruption. We suggest a unimodal multi-task learning, which distills cross-modal knowledge and aligns the corrupted modalities. Our experiments on robust AVSR benchmarks demonstrate that the corrupted representation learning method significantly enhances recognition accuracy.
arXiv Detail & Related papers (2025-01-23T05:11:19Z)
What If the Input is Expanded in OOD Detection? [77.37433624869857]
Out-of-distribution (OOD) detection aims to identify OOD inputs from unknown classes. Various scoring functions are proposed to distinguish it from in-distribution (ID) data. We introduce a novel perspective, i.e., employing different common corruptions on the input space.
arXiv Detail & Related papers (2024-10-24T06:47:28Z)
Learning Trimodal Relation for AVQA with Missing Modality [13.705369273831055]
We propose a framework that ensures robust Audio-Visual Question Answering (AVQA) performance even when a modality is missing. Our method can provide accurate answers by effectively utilizing available information even when input modalities are missing.
arXiv Detail & Related papers (2024-07-23T04:35:56Z)
A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech Recognition [53.800937914403654]
Advanced Audio-Visual Speech Recognition (AVSR) systems have been observed to be sensitive to missing video frames. While applying the dropout technique to the video modality enhances robustness to missing frames, it simultaneously results in a performance loss when dealing with complete data input. We propose a novel Multimodal Distribution Approximation with Knowledge Distillation (MDA-KD) framework to reduce over-reliance on the audio modality.
arXiv Detail & Related papers (2024-03-07T06:06:55Z)
AVTENet: Audio-Visual Transformer-based Ensemble Network Exploiting Multiple Experts for Video Deepfake Detection [53.448283629898214]
The recent proliferation of hyper-realistic deepfake videos has drawn attention to the threat of audio and visual forgeries. Most previous work on detecting AI-generated fake videos only utilize visual modality or audio modality. We propose an Audio-Visual Transformer-based Ensemble Network (AVTENet) framework that considers both acoustic manipulation and visual manipulation.
arXiv Detail & Related papers (2023-10-19T19:01:26Z)
AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models [92.92233932921741]
We propose the AV-SUPERB benchmark that enables general-purpose evaluation of unimodal audio/visual and bimodal fusion representations. We evaluate 5 recent self-supervised models and show that none of these models generalize to all tasks. We show that representations may be improved with intermediate-task fine-tuning and audio event classification with AudioSet serves as a strong intermediate task.
arXiv Detail & Related papers (2023-09-19T17:35:16Z)
Visually-Guided Sound Source Separation with Audio-Visual Predictive Coding [57.08832099075793]
Visually-guided sound source separation consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing. This paper presents audio-visual predictive coding (AVPC) to tackle this task in parameter harmonizing and more effective manner. In addition, we develop a valid self-supervised learning strategy for AVPC via co-predicting two audio-visual representations of the same sound source.
arXiv Detail & Related papers (2023-06-19T03:10:57Z)
A Multi-View Approach To Audio-Visual Speaker Verification [38.9710777250597]
In this study, we explore audio-visual approaches to speaker verification. We report the lowest AV equal error rate (EER) of 0.7% on the VoxCeleb1 dataset. This new approach achieves 28% EER on VoxCeleb1 in the challenging testing condition of cross-modal verification.
arXiv Detail & Related papers (2021-02-11T22:29:25Z)
Multimodal Attention Fusion for Target Speaker Extraction [108.73502348754842]
We propose a novel attention mechanism for multi-modal fusion and its training methods. Our proposals improve signal to distortion ratio (SDR) by 1.0 dB over conventional fusion mechanisms on simulated data.
arXiv Detail & Related papers (2021-02-02T05:59:35Z)
How to Teach DNNs to Pay Attention to the Visual Modality in Speech Recognition [10.74796391075403]
This study investigates the inner workings of AV Align and visualises the audio-visual alignment patterns. We find that AV Align learns to align acoustic and visual representations of speech at the frame level on TCD-TIMIT in a generally monotonic pattern. We propose a regularisation method which involves predicting lip-related Action Units from visual representations.
arXiv Detail & Related papers (2020-04-17T13:59:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.