Related papers: Learning What To Hear: Boosting Sound-Source Association For Robust Audiovisual Instance Segmentation

Learning What To Hear: Boosting Sound-Source Association For Robust Audiovisual Instance Segmentation

URL: http://arxiv.org/abs/2509.22740v1
Date: Fri, 26 Sep 2025 02:31:17 GMT
Title: Learning What To Hear: Boosting Sound-Source Association For Robust Audiovisual Instance Segmentation
Authors: Jinbae Seo, Hyeongjun Kwon, Kwonyoung Kim, Jiyoung Lee, Kwanghoon Sohn,
Abstract summary: Existing methods suffer from visual bias stemming from two fundamental issues: uniform additive fusion prevents queries from specializing to different sound sources, and visual-only training objectives allow queries to converge to arbitrary salient objects.<n>We propose Audio-Centric Query Generation using cross-attention, enabling each query to selectively attend to distinct sound sources and carry sound-specific priors into visual decoding.
Score: 37.91678426119673
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Audiovisual instance segmentation (AVIS) requires accurately localizing and tracking sounding objects throughout video sequences. Existing methods suffer from visual bias stemming from two fundamental issues: uniform additive fusion prevents queries from specializing to different sound sources, while visual-only training objectives allow queries to converge to arbitrary salient objects. We propose Audio-Centric Query Generation using cross-attention, enabling each query to selectively attend to distinct sound sources and carry sound-specific priors into visual decoding. Additionally, we introduce Sound-Aware Ordinal Counting (SAOC) loss that explicitly supervises sounding object numbers through ordinal regression with monotonic consistency constraints, preventing visual-only convergence during training. Experiments on AVISeg benchmark demonstrate consistent improvements: +1.64 mAP, +0.6 HOTA, and +2.06 FSLA, validating that query specialization and explicit counting supervision are crucial for accurate audiovisual instance segmentation.

Related papers

Revisiting Audio-Visual Segmentation with Vision-Centric Transformer [60.83798235788669]
Audio-Visual (AVS) aims to segment sound-producing objects in video frames based on the associated audio signal.<n>We propose a new Vision-Centric Transformer framework that leverages vision-derived queries to iteratively fetch corresponding audio and visual information.<n>Our framework achieves new state-of-the-art performances on three subsets of the AVSBench dataset.
arXiv Detail & Related papers (2025-06-30T08:40:36Z)
Towards Open-Vocabulary Audio-Visual Event Localization [59.23161248808759]
We introduce the Open-Vocabulary Audio-Visual Event localization problem.<n>This problem requires localizing audio-visual events and predicting explicit categories for both seen and unseen data at inference.<n>We propose the OV-AVEBench dataset, comprising 24,800 videos across 67 real-life audio-visual scenes.
arXiv Detail & Related papers (2024-11-18T04:35:20Z)
Progressive Confident Masking Attention Network for Audio-Visual Segmentation [7.864898315909104]
A challenging problem known as Audio-Visual (AVS) has emerged, intending to produce segmentation maps for sounding objects within a scene.<n>We introduce a novel Progressive Confident Masking Attention Network (PMCANet)<n>It leverages attention mechanisms to uncover the intrinsic correlations between audio signals and visual frames.
arXiv Detail & Related papers (2024-06-04T14:21:41Z)
Leveraging Foundation models for Unsupervised Audio-Visual Segmentation [49.94366155560371]
Audio-Visual (AVS) aims to precisely outline audible objects in a visual scene at the pixel level. Existing AVS methods require fine-grained annotations of audio-mask pairs in supervised learning fashion. We introduce unsupervised audio-visual segmentation with no need for task-specific data annotations and model training.
arXiv Detail & Related papers (2023-09-13T05:05:47Z)
Unraveling Instance Associations: A Closer Look for Audio-Visual Segmentation [18.001730255429347]
Audio-visual segmentation (AVS) is a challenging task that involves accurately segmenting sounding objects based on audio-visual cues. We propose a new cost-effective strategy to build challenging and relatively unbiased high-quality audio-visual segmentation benchmarks. Experiments conducted on existing AVS datasets and on our new benchmark show that our method achieves state-of-the-art (SOTA) segmentation accuracy.
arXiv Detail & Related papers (2023-04-06T09:54:06Z)
Play It Back: Iterative Attention for Audio Recognition [104.628661890361]
A key function of auditory cognition is the association of characteristic sounds with their corresponding semantics over time. We propose an end-to-end attention-based architecture that through selective repetition attends over the most discriminative sounds. We show that our method can consistently achieve state-of-the-art performance across three audio-classification benchmarks.
arXiv Detail & Related papers (2022-10-20T15:03:22Z)
AV-Gaze: A Study on the Effectiveness of Audio Guided Visual Attention Estimation for Non-Profilic Faces [28.245662058349854]
In this paper, we explore if audio-guided coarse head-pose can further enhance visual attention estimation performance for non-prolific faces. We use off-the-shelf state-of-the-art models to facilitate cross-modal weak-supervision. Our model can utilize any of the available modalities for task-specific inference.
arXiv Detail & Related papers (2022-07-07T02:23:02Z)
Visual Scene Graphs for Audio Source Separation [65.47212419514761]
State-of-the-art approaches for visually-guided audio source separation typically assume sources that have characteristic sounds, such as musical instruments. We propose Audio Visual Scene Graph Segmenter (AVSGS), a novel deep learning model that embeds the visual structure of the scene as a graph and segments this graph into subgraphs. Our pipeline is trained end-to-end via a self-supervised task consisting of separating audio sources using the visual graph from artificially mixed sounds.
arXiv Detail & Related papers (2021-09-24T13:40:51Z)
Audiovisual transfer learning for audio tagging and sound event detection [21.574781022415372]
We study the merit of transfer learning for two sound recognition problems, i.e., audio tagging and sound event detection. We adapt a baseline system utilizing only spectral acoustic inputs to make use of pretrained auditory and visual features. We perform experiments with these modified models on an audiovisual multi-label data set.
arXiv Detail & Related papers (2021-06-09T21:55:05Z)
Positive Sample Propagation along the Audio-Visual Event Line [29.25572713908162]
Visual and audio signals often coexist in natural environments, forming audio-visual events (AVEs) We propose a new positive sample propagation (PSP) module to discover and exploit closely related audio-visual pairs. We perform extensive experiments on the public AVE dataset and achieve new state-of-the-art accuracy in both fully and weakly supervised settings.
arXiv Detail & Related papers (2021-04-01T03:53:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.