Related papers: Sound Source Localization is All about Cross-Modal Alignment

Sound Source Localization is All about Cross-Modal Alignment

URL: http://arxiv.org/abs/2309.10724v1
Date: Tue, 19 Sep 2023 16:04:50 GMT
Title: Sound Source Localization is All about Cross-Modal Alignment
Authors: Arda Senocak, Hyeonggon Ryu, Junsik Kim, Tae-Hyun Oh, Hanspeter Pfister, Joon Son Chung
Abstract summary: Cross-modal semantic understanding is essential for genuine sound source localization. We propose a joint task with sound source localization to better learn the interaction between audio and visual modalities. Our method outperforms the state-of-the-art approaches in both sound source localization and cross-modal retrieval.
Score: 53.957081836232206
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Humans can easily perceive the direction of sound sources in a visual scene, termed sound source localization. Recent studies on learning-based sound source localization have mainly explored the problem from a localization perspective. However, prior arts and existing benchmarks do not account for a more important aspect of the problem, cross-modal semantic understanding, which is essential for genuine sound source localization. Cross-modal semantic understanding is important in understanding semantically mismatched audio-visual events, e.g., silent objects, or off-screen sounds. To account for this, we propose a cross-modal alignment task as a joint task with sound source localization to better learn the interaction between audio and visual modalities. Thereby, we achieve high localization performance with strong cross-modal semantic understanding. Our method outperforms the state-of-the-art approaches in both sound source localization and cross-modal retrieval. Our work suggests that jointly tackling both tasks is necessary to conquer genuine sound source localization.

Related papers

Object-aware Sound Source Localization via Audio-Visual Scene Understanding [14.801564966406486]
Existing methods struggle with accurately localizing sound-making objects in complex scenes.<n>This limitation arises primarily from their reliance on simple audio-visual correspondence.<n>We propose a novel sound source localization framework leveraging Multimodal Large Language Models.
arXiv Detail & Related papers (2025-06-23T12:08:07Z)
Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization [50.122441710500055]
Dense-localization Audio-Visual Events (DAVE) aims to identify time boundaries and corresponding categories for events that can be heard and seen concurrently in an untrimmed video. Existing methods typically encode audio and visual representation separately without any explicit cross-modal alignment constraint. We present LOCO, a Locality-aware cross-modal Correspondence learning framework for DAVE.
arXiv Detail & Related papers (2024-09-12T11:54:25Z)
Aligning Sight and Sound: Advanced Sound Source Localization Through Audio-Visual Alignment [50.92136296059296]
Cross-modal interaction is vital for understanding semantically matched or mismatched audio-visual events. New benchmarks and evaluation metrics reveal previously overlooked issues in sound source localization studies. This work provides the most comprehensive analysis of sound source localization to date.
arXiv Detail & Related papers (2024-07-18T16:51:15Z)
T-VSL: Text-Guided Visual Sound Source Localization in Mixtures [33.28678401737415]
We develop a framework to disentangle audio-visual source correspondence from multi-source mixtures. Our framework exhibits promising zero-shot transferability to unseen classes during test time. Experiments conducted on the MUSIC, VGGSound, and VGGSound-Instruments datasets demonstrate significant performance improvements over state-of-the-art methods.
arXiv Detail & Related papers (2024-04-02T09:07:05Z)
Audio-Visual Spatial Integration and Recursive Attention for Robust Sound Source Localization [13.278494654137138]
Humans utilize both audio and visual modalities as spatial cues to locate sound sources. We propose an audio-visual spatial integration network that integrates spatial cues from both modalities. Our method can perform more robust sound source localization.
arXiv Detail & Related papers (2023-08-11T11:57:58Z)
FlowGrad: Using Motion for Visual Sound Source Localization [22.5799820040774]
This work introduces temporal context into the state-of-the-art methods for sound source localization in urban scenes using optical flow as a means to encode motion information. An analysis of the strengths and weaknesses of our methods helps us better understand the problem of visual sound source localization and sheds light on open challenges for audio-visual scene understanding.
arXiv Detail & Related papers (2022-11-15T18:12:10Z)
Hear The Flow: Optical Flow-Based Self-Supervised Visual Sound Source Localization [11.059590443280726]
Learning to localize the sound source in videos without explicit annotations is a novel area of audio-visual research. In a video, oftentimes, the objects exhibiting movement are the ones generating the sound. In this work, we capture this characteristic by modeling the optical flow in a video as a prior to better aid in localizing the sound source.
arXiv Detail & Related papers (2022-11-06T03:48:45Z)
Visual Sound Localization in the Wild by Cross-Modal Interference Erasing [90.21476231683008]
In real-world scenarios, audios are usually contaminated by off-screen sound and background noise. We propose the Interference Eraser (IEr) framework, which tackles the problem of audio-visual sound source localization in the wild.
arXiv Detail & Related papers (2022-02-13T21:06:19Z)
Class-aware Sounding Objects Localization via Audiovisual Correspondence [51.39872698365446]
We propose a two-stage step-by-step learning framework to localize and recognize sounding objects in complex audiovisual scenarios. We generate class-aware object localization maps in cocktail-party scenarios and use audiovisual correspondence to suppress silent areas. Experiments on both realistic and synthesized videos show that our model is superior in localizing and recognizing objects as well as filtering out silent ones.
arXiv Detail & Related papers (2021-12-22T09:34:33Z)
Unsupervised Sound Localization via Iterative Contrastive Learning [106.56167882750792]
We propose an iterative contrastive learning framework that requires no data annotations. We then use the pseudo-labels to learn the correlation between the visual and audio signals sampled from the same video. Our iterative strategy gradually encourages the localization of the sounding objects and reduces the correlation between the non-sounding regions and the reference audio.
arXiv Detail & Related papers (2021-04-01T07:48:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.