Improving Sound Source Localization with Joint Slot Attention on Image and Audio
- URL: http://arxiv.org/abs/2504.15118v1
- Date: Mon, 21 Apr 2025 14:16:46 GMT
- Title: Improving Sound Source Localization with Joint Slot Attention on Image and Audio
- Authors: Inho Kim, Youngkil Song, Jicheol Park, Won Hwa Kim, Suha Kwak,
- Abstract summary: Sound source localization (SSL) is the task of locating the source of sound within an image.<n>Previous work samples one of local image features as the image embedding and aggregates all local audio features to obtain the audio embedding.<n>We present a novel SSL method that addresses this chronic issue by joint slot attention on image and audio.
- Score: 24.922273090257264
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sound source localization (SSL) is the task of locating the source of sound within an image. Due to the lack of localization labels, the de facto standard in SSL has been to represent an image and audio as a single embedding vector each, and use them to learn SSL via contrastive learning. To this end, previous work samples one of local image features as the image embedding and aggregates all local audio features to obtain the audio embedding, which is far from optimal due to the presence of noise and background irrelevant to the actual target in the input. We present a novel SSL method that addresses this chronic issue by joint slot attention on image and audio. To be specific, two slots competitively attend image and audio features to decompose them into target and off-target representations, and only target representations of image and audio are used for contrastive learning. Also, we introduce cross-modal attention matching to further align local features of image and audio. Our method achieved the best in almost all settings on three public benchmarks for SSL, and substantially outperformed all the prior work in cross-modal retrieval.
Related papers
- Multi-scale Multi-instance Visual Sound Localization and Segmentation [15.624453757710802]
We propose a novel multi-scale visual sound localization framework, namely M2VSL.
M2VSL learns multi-scale semantic features associated with sound sources from the input image to localize sounding objects.
We conduct extensive experiments on VGGSound-Instruments, VGG-Sound Sources, and AVSBench benchmarks.
arXiv Detail & Related papers (2024-08-31T15:43:22Z) - Unveiling Visual Biases in Audio-Visual Localization Benchmarks [52.76903182540441]
We identify a significant issue in existing benchmarks.
The sounding objects are often easily recognized based solely on visual cues, which we refer to as visual bias.
Our findings suggest that existing AVSL benchmarks need further refinement to facilitate audio-visual learning.
arXiv Detail & Related papers (2024-08-25T04:56:08Z) - Can CLIP Help Sound Source Localization? [19.370071553914954]
We introduce a framework that translates audio signals into tokens compatible with CLIP's text encoder.
By directly using these embeddings, our method generates audio-grounded masks for the provided audio.
Our findings suggest that utilizing pre-trained image-text models enable our model to generate more complete and compact localization maps for the sounding objects.
arXiv Detail & Related papers (2023-11-07T15:26:57Z) - Sound Source Localization is All about Cross-Modal Alignment [53.957081836232206]
Cross-modal semantic understanding is essential for genuine sound source localization.
We propose a joint task with sound source localization to better learn the interaction between audio and visual modalities.
Our method outperforms the state-of-the-art approaches in both sound source localization and cross-modal retrieval.
arXiv Detail & Related papers (2023-09-19T16:04:50Z) - A Unified Audio-Visual Learning Framework for Localization, Separation,
and Recognition [26.828874753756523]
We propose a unified audio-visual learning framework (dubbed OneAVM) that integrates audio and visual cues for joint localization, separation, and recognition.
OneAVM comprises a shared audio-visual encoder and task-specific decoders trained with three objectives.
Experiments on MUSIC, VGG-Instruments, VGG-Music, and VGGSound datasets demonstrate the effectiveness of OneAVM for all three tasks.
arXiv Detail & Related papers (2023-05-30T23:53:12Z) - LISA: Localized Image Stylization with Audio via Implicit Neural
Representation [17.672008998994816]
We present a novel framework, Localized Image Stylization with Audio (LISA)
LISA performs audio-driven localized image stylization.
We show that the proposed framework outperforms the other audio-guided stylization methods.
arXiv Detail & Related papers (2022-11-21T11:51:48Z) - Self-Supervised Predictive Learning: A Negative-Free Method for Sound
Source Localization in Visual Scenes [91.59435809457659]
Self-Supervised Predictive Learning (SSPL) is a negative-free method for sound localization via explicit positive mining.
SSPL achieves significant improvements of 8.6% cIoU and 3.4% AUC on SoundNet-Flickr compared to the previous best.
arXiv Detail & Related papers (2022-03-25T01:42:42Z) - Visual Sound Localization in the Wild by Cross-Modal Interference
Erasing [90.21476231683008]
In real-world scenarios, audios are usually contaminated by off-screen sound and background noise.
We propose the Interference Eraser (IEr) framework, which tackles the problem of audio-visual sound source localization in the wild.
arXiv Detail & Related papers (2022-02-13T21:06:19Z) - Class-aware Sounding Objects Localization via Audiovisual Correspondence [51.39872698365446]
We propose a two-stage step-by-step learning framework to localize and recognize sounding objects in complex audiovisual scenarios.
We generate class-aware object localization maps in cocktail-party scenarios and use audiovisual correspondence to suppress silent areas.
Experiments on both realistic and synthesized videos show that our model is superior in localizing and recognizing objects as well as filtering out silent ones.
arXiv Detail & Related papers (2021-12-22T09:34:33Z) - Discriminative Sounding Objects Localization via Self-supervised
Audiovisual Matching [87.42246194790467]
We propose a two-stage learning framework to perform self-supervised class-aware sounding object localization.
We show that our model is superior in filtering out silent objects and pointing out the location of sounding objects of different classes.
arXiv Detail & Related papers (2020-10-12T05:51:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.