A Study of Dropout-Induced Modality Bias on Robustness to Missing Video
Frames for Audio-Visual Speech Recognition
- URL: http://arxiv.org/abs/2403.04245v1
- Date: Thu, 7 Mar 2024 06:06:55 GMT
- Title: A Study of Dropout-Induced Modality Bias on Robustness to Missing Video
Frames for Audio-Visual Speech Recognition
- Authors: Yusheng Dai, Hang Chen, Jun Du, Ruoyu Wang, Shihao Chen, Jiefeng Ma,
Haotian Wang, Chin-Hui Lee
- Abstract summary: Advanced Audio-Visual Speech Recognition (AVSR) systems have been observed to be sensitive to missing video frames.
While applying the dropout technique to the video modality enhances robustness to missing frames, it simultaneously results in a performance loss when dealing with complete data input.
We propose a novel Multimodal Distribution Approximation with Knowledge Distillation (MDA-KD) framework to reduce over-reliance on the audio modality.
- Score: 53.800937914403654
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Advanced Audio-Visual Speech Recognition (AVSR) systems have been observed to
be sensitive to missing video frames, performing even worse than
single-modality models. While applying the dropout technique to the video
modality enhances robustness to missing frames, it simultaneously results in a
performance loss when dealing with complete data input. In this paper, we
investigate this contrasting phenomenon from the perspective of modality bias
and reveal that an excessive modality bias on the audio caused by dropout is
the underlying reason. Moreover, we present the Modality Bias Hypothesis (MBH)
to systematically describe the relationship between modality bias and
robustness against missing modality in multimodal systems. Building on these
findings, we propose a novel Multimodal Distribution Approximation with
Knowledge Distillation (MDA-KD) framework to reduce over-reliance on the audio
modality and to maintain performance and robustness simultaneously. Finally, to
address an entirely missing modality, we adopt adapters to dynamically switch
decision strategies. The effectiveness of our proposed approach is evaluated
and validated through a series of comprehensive experiments using the MISP2021
and MISP2022 datasets. Our code is available at
https://github.com/dalision/ModalBiasAVSR
Related papers
- Missingness-resilient Video-enhanced Multimodal Disfluency Detection [3.3281516035025285]
We present a practical multimodal disfluency detection approach that leverages available video data together with audio.
Our resilient design accommodates real-world scenarios where the video modality may sometimes be missing during inference.
In experiments across five disfluency-detection tasks, our unified multimodal approach significantly outperforms Audio-only unimodal methods.
arXiv Detail & Related papers (2024-06-11T05:47:16Z) - Exploring Missing Modality in Multimodal Egocentric Datasets [89.76463983679058]
We introduce a novel concept -Missing Modality Token (MMT)-to maintain performance even when modalities are absent.
Our method mitigates the performance loss, reducing it from its original $sim 30%$ drop to only $sim 10%$ when half of the test set is modal-incomplete.
arXiv Detail & Related papers (2024-01-21T11:55:42Z) - CMMD: Contrastive Multi-Modal Diffusion for Video-Audio Conditional Modeling [21.380988939240844]
We introduce a multi-modal diffusion model tailored for the bi-directional conditional generation of video and audio.
We propose a joint contrastive training loss to improve the synchronization between visual and auditory occurrences.
arXiv Detail & Related papers (2023-12-08T23:55:19Z) - Multimodal Imbalance-Aware Gradient Modulation for Weakly-supervised
Audio-Visual Video Parsing [107.031903351176]
Weakly-separated audio-visual video parsing (WS-AVVP) aims to localize the temporal extents of audio, visual and audio-visual event instances.
WS-AVVP aims to identify the corresponding event categories with only video-level category labels for training.
arXiv Detail & Related papers (2023-07-05T05:55:10Z) - Cross-Attention is Not Enough: Incongruity-Aware Dynamic Hierarchical
Fusion for Multimodal Affect Recognition [69.32305810128994]
Incongruity between modalities poses a challenge for multimodal fusion, especially in affect recognition.
We propose the Hierarchical Crossmodal Transformer with Dynamic Modality Gating (HCT-DMG), a lightweight incongruity-aware model.
HCT-DMG: 1) outperforms previous multimodal models with a reduced size of approximately 0.8M parameters; 2) recognizes hard samples where incongruity makes affect recognition difficult; 3) mitigates the incongruity at the latent level in crossmodal attention.
arXiv Detail & Related papers (2023-05-23T01:24:15Z) - Self-attention fusion for audiovisual emotion recognition with
incomplete data [103.70855797025689]
We consider the problem of multimodal data analysis with a use case of audiovisual emotion recognition.
We propose an architecture capable of learning from raw data and describe three variants of it with distinct modality fusion mechanisms.
arXiv Detail & Related papers (2022-01-26T18:04:29Z) - Multi-Modal Perception Attention Network with Self-Supervised Learning
for Audio-Visual Speaker Tracking [18.225204270240734]
We propose a novel Multi-modal Perception Tracker (MPT) for speaker tracking using both audio and visual modalities.
MPT achieves 98.6% and 78.3% tracking accuracy on the standard and occluded datasets, respectively.
arXiv Detail & Related papers (2021-12-14T14:14:17Z) - Modality Compensation Network: Cross-Modal Adaptation for Action
Recognition [77.24983234113957]
We propose a Modality Compensation Network (MCN) to explore the relationships of different modalities.
Our model bridges data from source and auxiliary modalities by a modality adaptation block to achieve adaptive representation learning.
Experimental results reveal that MCN outperforms state-of-the-art approaches on four widely-used action recognition benchmarks.
arXiv Detail & Related papers (2020-01-31T04:51:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.