Can audio-visual integration strengthen robustness under multimodal
attacks?
- URL: http://arxiv.org/abs/2104.02000v1
- Date: Mon, 5 Apr 2021 16:46:45 GMT
- Title: Can audio-visual integration strengthen robustness under multimodal
attacks?
- Authors: Yapeng Tian and Chenliang Xu
- Abstract summary: We use the audio-visual event recognition task against multimodal adversarial attacks as a proxy to investigate the robustness of audio-visual learning.
We attack audio, visual, and both modalities to explore whether audio-visual integration still strengthens perception.
For interpreting the multimodal interactions under attacks, we learn a weakly-supervised sound source visual localization model.
- Score: 47.791552254215745
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we propose to make a systematic study on machines multisensory
perception under attacks. We use the audio-visual event recognition task
against multimodal adversarial attacks as a proxy to investigate the robustness
of audio-visual learning. We attack audio, visual, and both modalities to
explore whether audio-visual integration still strengthens perception and how
different fusion mechanisms affect the robustness of audio-visual models. For
interpreting the multimodal interactions under attacks, we learn a
weakly-supervised sound source visual localization model to localize sounding
regions in videos. To mitigate multimodal attacks, we propose an audio-visual
defense approach based on an audio-visual dissimilarity constraint and external
feature memory banks. Extensive experiments demonstrate that audio-visual
models are susceptible to multimodal adversarial attacks; audio-visual
integration could decrease the model robustness rather than strengthen under
multimodal attacks; even a weakly-supervised sound source visual localization
model can be successfully fooled; our defense method can improve the
invulnerability of audio-visual networks without significantly sacrificing
clean model performance.
Related papers
- Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion
Latent Aligners [69.70590867769408]
Video and audio content creation serves as the core technique for the movie industry and professional users.
Existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry.
In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation.
arXiv Detail & Related papers (2024-02-27T17:57:04Z) - On Robustness to Missing Video for Audiovisual Speech Recognition [17.261450158359402]
We show that missing video frames should not degrade the performance of an audiovisual model to be worse than that of a single-modality audio-only model.
We introduce a framework that allows claims about robustness to be evaluated in a precise and testable way.
arXiv Detail & Related papers (2023-12-13T05:32:52Z) - Visually-Guided Sound Source Separation with Audio-Visual Predictive
Coding [57.08832099075793]
Visually-guided sound source separation consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing.
This paper presents audio-visual predictive coding (AVPC) to tackle this task in parameter harmonizing and more effective manner.
In addition, we develop a valid self-supervised learning strategy for AVPC via co-predicting two audio-visual representations of the same sound source.
arXiv Detail & Related papers (2023-06-19T03:10:57Z) - Push-Pull: Characterizing the Adversarial Robustness for Audio-Visual
Active Speaker Detection [88.74863771919445]
We reveal the vulnerability of AVASD models under audio-only, visual-only, and audio-visual adversarial attacks.
We also propose a novel audio-visual interaction loss (AVIL) for making attackers difficult to find feasible adversarial examples.
arXiv Detail & Related papers (2022-10-03T08:10:12Z) - Modality-Aware Contrastive Instance Learning with Self-Distillation for
Weakly-Supervised Audio-Visual Violence Detection [14.779452690026144]
We propose a modality-aware contrastive instance learning with self-distillation (MACIL-SD) strategy for weakly-supervised audio-visual learning.
Our framework outperforms previous methods with lower complexity on the large-scale XD-Violence dataset.
arXiv Detail & Related papers (2022-07-12T12:42:21Z) - Multimodal Attention Fusion for Target Speaker Extraction [108.73502348754842]
We propose a novel attention mechanism for multi-modal fusion and its training methods.
Our proposals improve signal to distortion ratio (SDR) by 1.0 dB over conventional fusion mechanisms on simulated data.
arXiv Detail & Related papers (2021-02-02T05:59:35Z) - Adversarial attacks on audio source separation [26.717340178640498]
We reformulate various adversarial attack methods for the audio source separation problem.
We propose a simple yet effective regularization method to obtain imperceptible adversarial noise.
We also show the robustness of source separation models against a black-box attack.
arXiv Detail & Related papers (2020-10-07T05:02:21Z) - Look, Listen, and Attend: Co-Attention Network for Self-Supervised
Audio-Visual Representation Learning [17.6311804187027]
An underlying correlation between audio and visual events can be utilized as free supervised information to train a neural network.
We propose a novel self-supervised framework with co-attention mechanism to learn generic cross-modal representations from unlabelled videos.
Experiments show that our model achieves state-of-the-art performance on the pretext task while having fewer parameters compared with existing methods.
arXiv Detail & Related papers (2020-08-13T10:08:12Z) - Curriculum Audiovisual Learning [113.20920928789867]
We present a flexible audiovisual model that introduces a soft-clustering module as the audio and visual content detector.
To ease the difficulty of audiovisual learning, we propose a novel learning strategy that trains the model from simple to complex scene.
We show that our localization model significantly outperforms existing methods, based on which we show comparable performance in sound separation without referring external visual supervision.
arXiv Detail & Related papers (2020-01-26T07:08:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.