Related papers: Cross-Modal Attention Consistency for Video-Audio Unsupervised Learning

Cross-Modal Attention Consistency for Video-Audio Unsupervised Learning

URL: http://arxiv.org/abs/2106.06939v1
Date: Sun, 13 Jun 2021 07:41:15 GMT
Title: Cross-Modal Attention Consistency for Video-Audio Unsupervised Learning
Authors: Shaobo Min, Qi Dai, Hongtao Xie, Chuang Gan, Yongdong Zhang, Jingdong Wang
Abstract summary: Cross-modal correlation provides an inherent supervision for video unsupervised representation learning. This paper introduces a pretext task, Cross-Modal Attention Consistency (CMAC), for exploring the bidirectional local correspondence property. CMAC aims to align the regional attention generated purely from the visual signal with the target attention generated under the guidance of acoustic signal.
Score: 141.38505371646482
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Cross-modal correlation provides an inherent supervision for video unsupervised representation learning. Existing methods focus on distinguishing different video clips by visual and audio representations. We human visual perception could attend to regions where sounds are made, and our auditory perception could also ground their frequencies of sounding objects, which we call bidirectional local correspondence. Such supervision is intuitive but not well explored in the contrastive learning framework. This paper introduces a pretext task, Cross-Modal Attention Consistency (CMAC), for exploring the bidirectional local correspondence property. The CMAC approach aims to align the regional attention generated purely from the visual signal with the target attention generated under the guidance of acoustic signal, and do a similar alignment for frequency grounding on the acoustic attention. Accompanied by a remoulded cross-modal contrastive loss where we consider additional within-modal interactions, the CMAC approach works effectively for enforcing the bidirectional alignment. Extensive experiments on six downstream benchmarks demonstrate that CMAC can improve the state-of-the-art performance on both visual and audio modalities.

Related papers

CLIP-AE: CLIP-assisted Cross-view Audio-Visual Enhancement for Unsupervised Temporal Action Localization [53.89574102984098]
unsupervised temporal action localization (UTAL) has gained popularity.<n>Current methods face two main challenges: 1) Classification pre-trained features overly focus on highly discriminative regions; 2) Solely relying on visual modality information makes it difficult to determine contextual boundaries.<n>We propose a CLIP-assisted cross-view audiovisual enhanced UTAL method.
arXiv Detail & Related papers (2025-05-29T15:03:59Z)
Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization [50.122441710500055]
Dense-localization Audio-Visual Events (DAVE) aims to identify time boundaries and corresponding categories for events that can be heard and seen concurrently in an untrimmed video. Existing methods typically encode audio and visual representation separately without any explicit cross-modal alignment constraint. We present LOCO, a Locality-aware cross-modal Correspondence learning framework for DAVE.
arXiv Detail & Related papers (2024-09-12T11:54:25Z)
Progressive Confident Masking Attention Network for Audio-Visual Segmentation [8.591836399688052]
A challenging problem known as Audio-Visual has emerged, intending to produce segmentation maps for sounding objects within a scene. We introduce a novel Progressive Confident Masking Attention Network (PMCANet) It leverages attention mechanisms to uncover the intrinsic correlations between audio signals and visual frames.
arXiv Detail & Related papers (2024-06-04T14:21:41Z)
Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues [75.73217916395386]
We propose a Bidirectional Audio-Visual Decoder (BAVD) with integrated bidirectional bridges. This interaction narrows the modality imbalance, facilitating more effective learning of integrated audio-visual representations. We also present a strategy for audio-visual frame-wise synchrony as fine-grained guidance of BAVD.
arXiv Detail & Related papers (2024-02-04T03:02:35Z)
Rethink Cross-Modal Fusion in Weakly-Supervised Audio-Visual Video Parsing [58.9467115916639]
We propose a messenger-guided mid-fusion transformer to reduce the uncorrelated cross-modal context in the fusion. The messengers condense the full cross-modal context into a compact representation to only preserve useful cross-modal information. We thus propose cross-audio prediction consistency to suppress the impact of irrelevant audio information on visual event prediction.
arXiv Detail & Related papers (2023-11-14T13:27:03Z)
CM-PIE: Cross-modal perception for interactive-enhanced audio-visual video parsing [23.85763377992709]
We propose a novel interactive-enhanced cross-modal perception method(CM-PIE), which can learn fine-grained features by applying a segment-based attention module. We show that our model offers improved parsing performance on the Look, Listen, and Parse dataset.
arXiv Detail & Related papers (2023-10-11T14:15:25Z)
Look at What I'm Doing: Self-Supervised Spatial Grounding of Narrations in Instructional Videos [78.34818195786846]
We introduce the task of spatially localizing narrated interactions in videos. Key to our approach is the ability to learn to spatially localize interactions with self-supervision on a large corpus of videos with accompanying transcribed narrations. We propose a multilayer cross-modal attention network that enables effective optimization of a contrastive loss during training.
arXiv Detail & Related papers (2021-10-20T14:45:13Z)
Unsupervised Sound Localization via Iterative Contrastive Learning [106.56167882750792]
We propose an iterative contrastive learning framework that requires no data annotations. We then use the pseudo-labels to learn the correlation between the visual and audio signals sampled from the same video. Our iterative strategy gradually encourages the localization of the sounding objects and reduces the correlation between the non-sounding regions and the reference audio.
arXiv Detail & Related papers (2021-04-01T07:48:29Z)
Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning [17.6311804187027]
An underlying correlation between audio and visual events can be utilized as free supervised information to train a neural network. We propose a novel self-supervised framework with co-attention mechanism to learn generic cross-modal representations from unlabelled videos. Experiments show that our model achieves state-of-the-art performance on the pretext task while having fewer parameters compared with existing methods.
arXiv Detail & Related papers (2020-08-13T10:08:12Z)
How to Teach DNNs to Pay Attention to the Visual Modality in Speech Recognition [10.74796391075403]
This study investigates the inner workings of AV Align and visualises the audio-visual alignment patterns. We find that AV Align learns to align acoustic and visual representations of speech at the frame level on TCD-TIMIT in a generally monotonic pattern. We propose a regularisation method which involves predicting lip-related Action Units from visual representations.
arXiv Detail & Related papers (2020-04-17T13:59:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.