Rethink Cross-Modal Fusion in Weakly-Supervised Audio-Visual Video
Parsing
- URL: http://arxiv.org/abs/2311.08151v1
- Date: Tue, 14 Nov 2023 13:27:03 GMT
- Title: Rethink Cross-Modal Fusion in Weakly-Supervised Audio-Visual Video
Parsing
- Authors: Yating Xu, Conghui Hu, Gim Hee Lee
- Abstract summary: We propose a messenger-guided mid-fusion transformer to reduce the uncorrelated cross-modal context in the fusion.
The messengers condense the full cross-modal context into a compact representation to only preserve useful cross-modal information.
We thus propose cross-audio prediction consistency to suppress the impact of irrelevant audio information on visual event prediction.
- Score: 58.9467115916639
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing works on weakly-supervised audio-visual video parsing adopt hybrid
attention network (HAN) as the multi-modal embedding to capture the cross-modal
context. It embeds the audio and visual modalities with a shared network, where
the cross-attention is performed at the input. However, such an early fusion
method highly entangles the two non-fully correlated modalities and leads to
sub-optimal performance in detecting single-modality events. To deal with this
problem, we propose the messenger-guided mid-fusion transformer to reduce the
uncorrelated cross-modal context in the fusion. The messengers condense the
full cross-modal context into a compact representation to only preserve useful
cross-modal information. Furthermore, due to the fact that microphones capture
audio events from all directions, while cameras only record visual events
within a restricted field of view, there is a more frequent occurrence of
unaligned cross-modal context from audio for visual event predictions. We thus
propose cross-audio prediction consistency to suppress the impact of irrelevant
audio information on visual event prediction. Experiments consistently
illustrate the superior performance of our framework compared to existing
state-of-the-art methods.
Related papers
- Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization [50.122441710500055]
Dense-localization Audio-Visual Events (DAVE) aims to identify time boundaries and corresponding categories for events that can be heard and seen concurrently in an untrimmed video.
Existing methods typically encode audio and visual representation separately without any explicit cross-modal alignment constraint.
We present LOCO, a Locality-aware cross-modal Correspondence learning framework for DAVE.
arXiv Detail & Related papers (2024-09-12T11:54:25Z) - CACE-Net: Co-guidance Attention and Contrastive Enhancement for Effective Audio-Visual Event Localization [11.525177542345215]
We introduce CACE-Net, which differs from most existing methods that solely use audio signals to guide visual information.
We propose an audio-visual co-guidance attention mechanism that allows for adaptive bi-directional cross-modal attentional guidance.
Experiments on the AVE dataset demonstrate that CACE-Net sets a new benchmark in the audio-visual event localization task.
arXiv Detail & Related papers (2024-08-04T07:48:12Z) - CoLeaF: A Contrastive-Collaborative Learning Framework for Weakly Supervised Audio-Visual Video Parsing [23.100602876056165]
Weakly supervised audio-visual video parsing methods aim to detect audible-only, visible-only, and audible-visible events using only video-level labels.
We propose CoLeaF, a novel learning framework that optimize the integration of cross-modal context in the embedding space.
Our experiments demonstrate that CoLeaF significantly improves the state-of-the-art results by an average of 1.9% and 2.4% F-score on the datasets.
arXiv Detail & Related papers (2024-05-17T10:51:15Z) - Audio-Visual Speaker Verification via Joint Cross-Attention [4.229744884478575]
Cross-modal joint attention to fully leverage the inter-modal complementary information and the intra-modal information for speaker verification.
We have shown that efficiently leveraging the intra- and inter-modal relationships significantly improves the performance of audio-visual fusion for speaker verification.
arXiv Detail & Related papers (2023-09-28T16:25:29Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Multi-Modulation Network for Audio-Visual Event Localization [138.14529518908736]
We study the problem of localizing audio-visual events that are both audible and visible in a video.
Existing works focus on encoding and aligning audio and visual features at the segment level.
We propose a novel MultiModulation Network (M2N) to learn the above correlation and leverage it as semantic guidance.
arXiv Detail & Related papers (2021-08-26T13:11:48Z) - Cross-Modal Attention Consistency for Video-Audio Unsupervised Learning [141.38505371646482]
Cross-modal correlation provides an inherent supervision for video unsupervised representation learning.
This paper introduces a pretext task, Cross-Modal Attention Consistency (CMAC), for exploring the bidirectional local correspondence property.
CMAC aims to align the regional attention generated purely from the visual signal with the target attention generated under the guidance of acoustic signal.
arXiv Detail & Related papers (2021-06-13T07:41:15Z) - Looking into Your Speech: Learning Cross-modal Affinity for Audio-visual
Speech Separation [73.1652905564163]
We address the problem of separating individual speech signals from videos using audio-visual neural processing.
Most conventional approaches utilize frame-wise matching criteria to extract shared information between co-occurring audio and video.
We propose a cross-modal affinity network (CaffNet) that learns global correspondence as well as locally-varying affinities between audio and visual streams.
arXiv Detail & Related papers (2021-03-25T15:39:12Z) - Look, Listen, and Attend: Co-Attention Network for Self-Supervised
Audio-Visual Representation Learning [17.6311804187027]
An underlying correlation between audio and visual events can be utilized as free supervised information to train a neural network.
We propose a novel self-supervised framework with co-attention mechanism to learn generic cross-modal representations from unlabelled videos.
Experiments show that our model achieves state-of-the-art performance on the pretext task while having fewer parameters compared with existing methods.
arXiv Detail & Related papers (2020-08-13T10:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.