On Robustness to Missing Video for Audiovisual Speech Recognition
- URL: http://arxiv.org/abs/2312.10088v2
- Date: Tue, 19 Dec 2023 01:44:13 GMT
- Title: On Robustness to Missing Video for Audiovisual Speech Recognition
- Authors: Oscar Chang, Otavio Braga, Hank Liao, Dmitriy Serdyuk, Olivier Siohan
- Abstract summary: We show that missing video frames should not degrade the performance of an audiovisual model to be worse than that of a single-modality audio-only model.
We introduce a framework that allows claims about robustness to be evaluated in a precise and testable way.
- Score: 17.261450158359402
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: It has been shown that learning audiovisual features can lead to improved
speech recognition performance over audio-only features, especially for noisy
speech. However, in many common applications, the visual features are partially
or entirely missing, e.g.~the speaker might move off screen. Multi-modal models
need to be robust: missing video frames should not degrade the performance of
an audiovisual model to be worse than that of a single-modality audio-only
model. While there have been many attempts at building robust models, there is
little consensus on how robustness should be evaluated. To address this, we
introduce a framework that allows claims about robustness to be evaluated in a
precise and testable way. We also conduct a systematic empirical study of the
robustness of common audiovisual speech recognition architectures on a range of
acoustic noise conditions and test suites. Finally, we show that an
architecture-agnostic solution based on cascades can consistently achieve
robustness to missing video, even in settings where existing techniques for
robustness like dropout fall short.
Related papers
- Can Large Audio-Language Models Truly Hear? Tackling Hallucinations with Multi-Task Assessment and Stepwise Audio Reasoning [55.2480439325792]
Large audio-language models (LALMs) have shown impressive capabilities in understanding and reasoning about audio and speech information.
These models still face challenges, including hallucinating non-existent sound events, misidentifying the order of sound events, and incorrectly attributing sound sources.
arXiv Detail & Related papers (2024-10-21T15:55:27Z) - A Study of Dropout-Induced Modality Bias on Robustness to Missing Video
Frames for Audio-Visual Speech Recognition [53.800937914403654]
Advanced Audio-Visual Speech Recognition (AVSR) systems have been observed to be sensitive to missing video frames.
While applying the dropout technique to the video modality enhances robustness to missing frames, it simultaneously results in a performance loss when dealing with complete data input.
We propose a novel Multimodal Distribution Approximation with Knowledge Distillation (MDA-KD) framework to reduce over-reliance on the audio modality.
arXiv Detail & Related papers (2024-03-07T06:06:55Z) - AdVerb: Visually Guided Audio Dereverberation [49.958724234969445]
We present AdVerb, a novel audio-visual dereverberation framework.
It uses visual cues in addition to the reverberant sound to estimate clean audio.
arXiv Detail & Related papers (2023-08-23T18:20:59Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - Multi-encoder attention-based architectures for sound recognition with
partial visual assistance [14.160670979300628]
We show that a multi-encoder framework can be employed to deal with this issue.
We show that the proposed model extension can successfully be utilized to incorporate partially available visual information.
arXiv Detail & Related papers (2022-09-26T16:32:33Z) - A Single Self-Supervised Model for Many Speech Modalities Enables
Zero-Shot Modality Transfer [31.028408352051684]
We present u-HuBERT, a self-supervised pre-training framework that can leverage both multimodal and unimodal speech.
Our single model yields 1.2%/1.4%/27.2% speech recognition word error rate on LRS3 with audio-visual/audio/visual input.
arXiv Detail & Related papers (2022-07-14T16:21:33Z) - Can audio-visual integration strengthen robustness under multimodal
attacks? [47.791552254215745]
We use the audio-visual event recognition task against multimodal adversarial attacks as a proxy to investigate the robustness of audio-visual learning.
We attack audio, visual, and both modalities to explore whether audio-visual integration still strengthens perception.
For interpreting the multimodal interactions under attacks, we learn a weakly-supervised sound source visual localization model.
arXiv Detail & Related papers (2021-04-05T16:46:45Z) - Audiovisual Saliency Prediction in Uncategorized Video Sequences based
on Audio-Video Correlation [0.0]
This work aims to provide a generic audio/video saliency model augmenting a visual saliency map with an audio saliency map computed by synchronizing low-level audio and visual features.
The proposed model was evaluated using different criteria against eye fixations data for a publicly available DIEM video dataset.
arXiv Detail & Related papers (2021-01-07T14:22:29Z) - VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device
Speech Recognition [60.462770498366524]
We introduce VoiceFilter-Lite, a single-channel source separation model that runs on the device to preserve only the speech signals from a target user.
We show that such a model can be quantized as a 8-bit integer model and run in realtime.
arXiv Detail & Related papers (2020-09-09T14:26:56Z) - Audio Impairment Recognition Using a Correlation-Based Feature
Representation [85.08880949780894]
We propose a new representation of hand-crafted features that is based on the correlation of feature pairs.
We show superior performance in terms of compact feature dimensionality and improved computational speed in the test stage.
arXiv Detail & Related papers (2020-03-22T13:34:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.