Hear The Flow: Optical Flow-Based Self-Supervised Visual Sound Source
Localization
- URL: http://arxiv.org/abs/2211.03019v1
- Date: Sun, 6 Nov 2022 03:48:45 GMT
- Title: Hear The Flow: Optical Flow-Based Self-Supervised Visual Sound Source
Localization
- Authors: Dennis Fedorishin, Deen Dayal Mohan, Bhavin Jawade, Srirangaraj
Setlur, Venu Govindaraju
- Abstract summary: Learning to localize the sound source in videos without explicit annotations is a novel area of audio-visual research.
In a video, oftentimes, the objects exhibiting movement are the ones generating the sound.
In this work, we capture this characteristic by modeling the optical flow in a video as a prior to better aid in localizing the sound source.
- Score: 11.059590443280726
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Learning to localize the sound source in videos without explicit annotations
is a novel area of audio-visual research. Existing work in this area focuses on
creating attention maps to capture the correlation between the two modalities
to localize the source of the sound. In a video, oftentimes, the objects
exhibiting movement are the ones generating the sound. In this work, we capture
this characteristic by modeling the optical flow in a video as a prior to
better aid in localizing the sound source. We further demonstrate that the
addition of flow-based attention substantially improves visual sound source
localization. Finally, we benchmark our method on standard sound source
localization datasets and achieve state-of-the-art performance on the Soundnet
Flickr and VGG Sound Source datasets. Code:
https://github.com/denfed/heartheflow.
Related papers
- Sound Source Localization is All about Cross-Modal Alignment [53.957081836232206]
Cross-modal semantic understanding is essential for genuine sound source localization.
We propose a joint task with sound source localization to better learn the interaction between audio and visual modalities.
Our method outperforms the state-of-the-art approaches in both sound source localization and cross-modal retrieval.
arXiv Detail & Related papers (2023-09-19T16:04:50Z) - Audio-Visual Spatial Integration and Recursive Attention for Robust
Sound Source Localization [13.278494654137138]
Humans utilize both audio and visual modalities as spatial cues to locate sound sources.
We propose an audio-visual spatial integration network that integrates spatial cues from both modalities.
Our method can perform more robust sound source localization.
arXiv Detail & Related papers (2023-08-11T11:57:58Z) - FlowGrad: Using Motion for Visual Sound Source Localization [22.5799820040774]
This work introduces temporal context into the state-of-the-art methods for sound source localization in urban scenes using optical flow as a means to encode motion information.
An analysis of the strengths and weaknesses of our methods helps us better understand the problem of visual sound source localization and sheds light on open challenges for audio-visual scene understanding.
arXiv Detail & Related papers (2022-11-15T18:12:10Z) - Self-Supervised Predictive Learning: A Negative-Free Method for Sound
Source Localization in Visual Scenes [91.59435809457659]
Self-Supervised Predictive Learning (SSPL) is a negative-free method for sound localization via explicit positive mining.
SSPL achieves significant improvements of 8.6% cIoU and 3.4% AUC on SoundNet-Flickr compared to the previous best.
arXiv Detail & Related papers (2022-03-25T01:42:42Z) - Visual Sound Localization in the Wild by Cross-Modal Interference
Erasing [90.21476231683008]
In real-world scenarios, audios are usually contaminated by off-screen sound and background noise.
We propose the Interference Eraser (IEr) framework, which tackles the problem of audio-visual sound source localization in the wild.
arXiv Detail & Related papers (2022-02-13T21:06:19Z) - Geometry-Aware Multi-Task Learning for Binaural Audio Generation from
Video [94.42811508809994]
We propose an audio spatialization method that draws on visual information in videos to convert their monaural (single-channel) audio to audio.
Whereas existing approaches leverage visual features extracted directly from video frames, our approach explicitly disentangles the geometric cues present in the visual stream to guide the learning process.
arXiv Detail & Related papers (2021-11-21T19:26:45Z) - Joint Learning of Visual-Audio Saliency Prediction and Sound Source
Localization on Multi-face Videos [101.83513408195692]
We propose a multitask learning method for visual-audio saliency prediction and sound source localization on multi-face video.
The proposed method outperforms 12 state-of-the-art saliency prediction methods, and achieves competitive results in sound source localization.
arXiv Detail & Related papers (2021-11-05T14:35:08Z) - Visual Scene Graphs for Audio Source Separation [65.47212419514761]
State-of-the-art approaches for visually-guided audio source separation typically assume sources that have characteristic sounds, such as musical instruments.
We propose Audio Visual Scene Graph Segmenter (AVSGS), a novel deep learning model that embeds the visual structure of the scene as a graph and segments this graph into subgraphs.
Our pipeline is trained end-to-end via a self-supervised task consisting of separating audio sources using the visual graph from artificially mixed sounds.
arXiv Detail & Related papers (2021-09-24T13:40:51Z) - Visually Guided Sound Source Separation and Localization using
Self-Supervised Motion Representations [16.447597767676655]
We aim to pinpoint the source location in the input video sequence.
Recent works have shown impressive audio-visual separation results when using prior knowledge of the source type.
We propose a two-stage architecture, called Appearance and Motion network (AMnet), where the stages specialise to appearance and motion cues.
arXiv Detail & Related papers (2021-04-17T10:09:15Z) - Multiple Sound Sources Localization from Coarse to Fine [41.56420350529494]
How to visually localize multiple sound sources in unconstrained videos is a formidable problem.
We develop a two-stage audiovisual learning framework that disentangles audio and visual representations of different categories from complex scenes.
Our model achieves state-of-the-art results on public dataset of localization.
arXiv Detail & Related papers (2020-07-13T12:59:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.