Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization
- URL: http://arxiv.org/abs/2409.07967v2
- Date: Tue, 18 Feb 2025 16:22:14 GMT
- Title: Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization
- Authors: Ling Xing, Hongyu Qu, Rui Yan, Xiangbo Shu, Jinhui Tang,
- Abstract summary: We present LoCo, a Locality-aware cross-modal Correspondence learning framework for Audio-Visual Events (DAVE)
LoCo applies Locality-aware Correspondence Correction (LCC) to unimodal features via leveraging cross-modal local-correlated properties.
We further customize Cross-modal Dynamic Perception layer (CDP) in cross-modal feature pyramid to understand local temporal patterns of audio-visual events.
- Score: 50.122441710500055
- License:
- Abstract: Dense-localization Audio-Visual Events (DAVE) aims to identify time boundaries and corresponding categories for events that can be heard and seen concurrently in an untrimmed video. Existing DAVE solutions extract audio and visual features through modality-specific encoders and fuse them via dense cross-attention. The independent processing of each modality neglects their complementarity, resulting in modality-specific noise, while dense attention fails to account for local temporal continuity of events, causing irrelevant signal distractions. In this paper, we present LoCo, a Locality-aware cross-modal Correspondence learning framework for DAVE. The core idea is to explore local temporal continuity nature of audio-visual events, which serves as informative yet free supervision signals to guide the filtering of irrelevant information and inspire the extraction of complementary multimodal information during both unimodal and cross-modal learning stages. i) Specifically, LoCo applies Locality-aware Correspondence Correction (LCC) to unimodal features via leveraging cross-modal local-correlated properties without any extra annotations. This enforces unimodal encoders to highlight similar semantics shared by audio and visual features. ii) To better aggregate such audio and visual features, we further customize Cross-modal Dynamic Perception layer (CDP) in cross-modal feature pyramid to understand local temporal patterns of audio-visual events by imposing local consistency within multimodal features in a data-driven manner. By incorporating LCC and CDP, LoCo provides solid performance gains and outperforms existing DAVE methods.
Related papers
- Dense Audio-Visual Event Localization under Cross-Modal Consistency and Multi-Temporal Granularity Collaboration [48.57159286673662]
This paper aims to advance audio-visual scene understanding for longer, untrimmed videos.
We introduce a novel CCNet, comprising two core modules: the Cross-Modal Consistency Collaboration and the Multi-Temporal Granularity Collaboration.
Experiments on the UnAV-100 dataset validate our module design, resulting in a new state-of-the-art performance in dense audio-visual event localization.
arXiv Detail & Related papers (2024-12-17T07:43:36Z) - CACE-Net: Co-guidance Attention and Contrastive Enhancement for Effective Audio-Visual Event Localization [11.525177542345215]
We introduce CACE-Net, which differs from most existing methods that solely use audio signals to guide visual information.
We propose an audio-visual co-guidance attention mechanism that allows for adaptive bi-directional cross-modal attentional guidance.
Experiments on the AVE dataset demonstrate that CACE-Net sets a new benchmark in the audio-visual event localization task.
arXiv Detail & Related papers (2024-08-04T07:48:12Z) - Rethink Cross-Modal Fusion in Weakly-Supervised Audio-Visual Video
Parsing [58.9467115916639]
We propose a messenger-guided mid-fusion transformer to reduce the uncorrelated cross-modal context in the fusion.
The messengers condense the full cross-modal context into a compact representation to only preserve useful cross-modal information.
We thus propose cross-audio prediction consistency to suppress the impact of irrelevant audio information on visual event prediction.
arXiv Detail & Related papers (2023-11-14T13:27:03Z) - Space-Time Memory Network for Sounding Object Localization in Videos [40.45443192327351]
We propose a space-time memory network for sounding object localization in videos.
It can simultaneously learn uni-temporal attention over both uni-temporal and cross-modal representations.
arXiv Detail & Related papers (2021-11-10T04:40:12Z) - Multi-Modulation Network for Audio-Visual Event Localization [138.14529518908736]
We study the problem of localizing audio-visual events that are both audible and visible in a video.
Existing works focus on encoding and aligning audio and visual features at the segment level.
We propose a novel MultiModulation Network (M2N) to learn the above correlation and leverage it as semantic guidance.
arXiv Detail & Related papers (2021-08-26T13:11:48Z) - Cross-Modal Attention Consistency for Video-Audio Unsupervised Learning [141.38505371646482]
Cross-modal correlation provides an inherent supervision for video unsupervised representation learning.
This paper introduces a pretext task, Cross-Modal Attention Consistency (CMAC), for exploring the bidirectional local correspondence property.
CMAC aims to align the regional attention generated purely from the visual signal with the target attention generated under the guidance of acoustic signal.
arXiv Detail & Related papers (2021-06-13T07:41:15Z) - Looking into Your Speech: Learning Cross-modal Affinity for Audio-visual
Speech Separation [73.1652905564163]
We address the problem of separating individual speech signals from videos using audio-visual neural processing.
Most conventional approaches utilize frame-wise matching criteria to extract shared information between co-occurring audio and video.
We propose a cross-modal affinity network (CaffNet) that learns global correspondence as well as locally-varying affinities between audio and visual streams.
arXiv Detail & Related papers (2021-03-25T15:39:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.