Robust Audio-Visual Segmentation via Audio-Guided Visual Convergent Alignment
- URL: http://arxiv.org/abs/2503.12847v1
- Date: Mon, 17 Mar 2025 05:48:22 GMT
- Title: Robust Audio-Visual Segmentation via Audio-Guided Visual Convergent Alignment
- Authors: Chen Liu, Peike Li, Liying Yang, Dadong Wang, Lincheng Li, Xin Yu,
- Abstract summary: Accurately localizing audible objects based on audio-visual cues is the core objective of audio-visual segmentation.<n>We propose a novel framework with two primary components: an audio-guided modality alignment (AMA) module and an uncertainty estimation (UE) module.<n>AMA performs audio-visual interactions within multiple groups and consolidates group features into compact representations based on their responsiveness to audio cues.<n>UE integrates spatial and temporal information to identify high-uncertainty regions caused by frequent changes in sound state.
- Score: 26.399212357764576
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Accurately localizing audible objects based on audio-visual cues is the core objective of audio-visual segmentation. Most previous methods emphasize spatial or temporal multi-modal modeling, yet overlook challenges from ambiguous audio-visual correspondences such as nearby visually similar but acoustically different objects and frequent shifts in objects' sounding status. Consequently, they may struggle to reliably correlate audio and visual cues, leading to over- or under-segmentation. To address these limitations, we propose a novel framework with two primary components: an audio-guided modality alignment (AMA) module and an uncertainty estimation (UE) module. Instead of indiscriminately correlating audio-visual cues through a global attention mechanism, AMA performs audio-visual interactions within multiple groups and consolidates group features into compact representations based on their responsiveness to audio cues, effectively directing the model's attention to audio-relevant areas. Leveraging contrastive learning, AMA further distinguishes sounding regions from silent areas by treating features with strong audio responses as positive samples and weaker responses as negatives. Additionally, UE integrates spatial and temporal information to identify high-uncertainty regions caused by frequent changes in sound state, reducing prediction errors by lowering confidence in these areas. Experimental results demonstrate that our approach achieves superior accuracy compared to existing state-of-the-art methods, particularly in challenging scenarios where traditional approaches struggle to maintain reliable segmentation.
Related papers
- AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection [57.649223695021114]
We present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection.
Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy.
arXiv Detail & Related papers (2025-04-06T13:59:16Z) - $C^2$AV-TSE: Context and Confidence-aware Audio Visual Target Speaker Extraction [80.57232374640911]
We propose a model-agnostic strategy called the Mask-And-Recover (MAR)
MAR integrates both inter- and intra-modality contextual correlations to enable global inference within extraction modules.
To better target challenging parts within each sample, we introduce a Fine-grained Confidence Score (FCS) model.
arXiv Detail & Related papers (2025-04-01T13:01:30Z) - Dynamic Derivation and Elimination: Audio Visual Segmentation with Enhanced Audio Semantics [26.399212357764576]
We propose Dynamic Derivation and Elimination (DDESeg): a novel audio-visual segmentation framework.<n>To mitigate feature confusion, DDESeg reconstructs the semantic content of the mixed audio signal.<n>To reduce the matching difficulty, we introduce a discriminative feature learning module.
arXiv Detail & Related papers (2025-03-17T05:38:05Z) - Do Audio-Visual Segmentation Models Truly Segment Sounding Objects? [38.98706069359109]
We introduce AVSBench-Robust, a benchmark incorporating diverse negative audio scenarios including silence, ambient noise, and off-screen sounds.<n>Our approach achieves remarkable improvements in both standard metrics and robustness measures, maintaining near-perfect false positive rates.
arXiv Detail & Related papers (2025-02-01T07:40:29Z) - Unveiling and Mitigating Bias in Audio Visual Segmentation [9.427676046134374]
Community researchers have developed a range of advanced audio-visual segmentation models to improve the quality of sounding objects' masks.
While masks created by these models may initially appear plausible, they occasionally exhibit anomalies with incorrect grounding logic.
We attribute this to real-world inherent preferences and distributions as a simpler signal for learning than the complex audio-visual grounding.
arXiv Detail & Related papers (2024-07-23T16:55:04Z) - Progressive Confident Masking Attention Network for Audio-Visual Segmentation [7.864898315909104]
A challenging problem known as Audio-Visual (AVS) has emerged, intending to produce segmentation maps for sounding objects within a scene.<n>We introduce a novel Progressive Confident Masking Attention Network (PMCANet)<n>It leverages attention mechanisms to uncover the intrinsic correlations between audio signals and visual frames.
arXiv Detail & Related papers (2024-06-04T14:21:41Z) - Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues [75.73217916395386]
We propose a Bidirectional Audio-Visual Decoder (BAVD) with integrated bidirectional bridges.
This interaction narrows the modality imbalance, facilitating more effective learning of integrated audio-visual representations.
We also present a strategy for audio-visual frame-wise synchrony as fine-grained guidance of BAVD.
arXiv Detail & Related papers (2024-02-04T03:02:35Z) - QDFormer: Towards Robust Audiovisual Segmentation in Complex Environments with Quantization-based Semantic Decomposition [47.103732403296654]
Multi-source semantic space can be represented as the Cartesian product of single-source sub-spaces.
We introduce a global-to-local quantization mechanism, which distills knowledge from stable global (clip-level) features into local (frame-level) ones.
Experiments demonstrate that our semantically decomposed audio representation significantly improves AVS performance.
arXiv Detail & Related papers (2023-09-29T20:48:44Z) - Improving Audio-Visual Segmentation with Bidirectional Generation [40.78395709407226]
We introduce a bidirectional generation framework for audio-visual segmentation.
This framework establishes robust correlations between an object's visual characteristics and its associated sound.
We also introduce an implicit volumetric motion estimation module to handle temporal dynamics.
arXiv Detail & Related papers (2023-08-16T11:20:23Z) - Visual Sound Localization in the Wild by Cross-Modal Interference
Erasing [90.21476231683008]
In real-world scenarios, audios are usually contaminated by off-screen sound and background noise.
We propose the Interference Eraser (IEr) framework, which tackles the problem of audio-visual sound source localization in the wild.
arXiv Detail & Related papers (2022-02-13T21:06:19Z) - Looking into Your Speech: Learning Cross-modal Affinity for Audio-visual
Speech Separation [73.1652905564163]
We address the problem of separating individual speech signals from videos using audio-visual neural processing.
Most conventional approaches utilize frame-wise matching criteria to extract shared information between co-occurring audio and video.
We propose a cross-modal affinity network (CaffNet) that learns global correspondence as well as locally-varying affinities between audio and visual streams.
arXiv Detail & Related papers (2021-03-25T15:39:12Z) - Data Fusion for Audiovisual Speaker Localization: Extending Dynamic
Stream Weights to the Spatial Domain [103.3388198420822]
Esting the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization.
This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions.
A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.
arXiv Detail & Related papers (2021-02-23T09:59:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.