FlowGrad: Using Motion for Visual Sound Source Localization
- URL: http://arxiv.org/abs/2211.08367v2
- Date: Fri, 14 Apr 2023 18:14:19 GMT
- Title: FlowGrad: Using Motion for Visual Sound Source Localization
- Authors: Rajsuryan Singh, Pablo Zinemanas, Xavier Serra, Juan Pablo Bello,
Magdalena Fuentes
- Abstract summary: This work introduces temporal context into the state-of-the-art methods for sound source localization in urban scenes using optical flow as a means to encode motion information.
An analysis of the strengths and weaknesses of our methods helps us better understand the problem of visual sound source localization and sheds light on open challenges for audio-visual scene understanding.
- Score: 22.5799820040774
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most recent work in visual sound source localization relies on semantic
audio-visual representations learned in a self-supervised manner, and by design
excludes temporal information present in videos. While it proves to be
effective for widely used benchmark datasets, the method falls short for
challenging scenarios like urban traffic. This work introduces temporal context
into the state-of-the-art methods for sound source localization in urban scenes
using optical flow as a means to encode motion information. An analysis of the
strengths and weaknesses of our methods helps us better understand the problem
of visual sound source localization and sheds light on open challenges for
audio-visual scene understanding.
Related papers
- Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization [50.122441710500055]
Dense-localization Audio-Visual Events (DAVE) aims to identify time boundaries and corresponding categories for events that can be heard and seen concurrently in an untrimmed video.
Existing methods typically encode audio and visual representation separately without any explicit cross-modal alignment constraint.
We present LOCO, a Locality-aware cross-modal Correspondence learning framework for DAVE.
arXiv Detail & Related papers (2024-09-12T11:54:25Z) - Sound Source Localization is All about Cross-Modal Alignment [53.957081836232206]
Cross-modal semantic understanding is essential for genuine sound source localization.
We propose a joint task with sound source localization to better learn the interaction between audio and visual modalities.
Our method outperforms the state-of-the-art approaches in both sound source localization and cross-modal retrieval.
arXiv Detail & Related papers (2023-09-19T16:04:50Z) - Hear The Flow: Optical Flow-Based Self-Supervised Visual Sound Source
Localization [11.059590443280726]
Learning to localize the sound source in videos without explicit annotations is a novel area of audio-visual research.
In a video, oftentimes, the objects exhibiting movement are the ones generating the sound.
In this work, we capture this characteristic by modeling the optical flow in a video as a prior to better aid in localizing the sound source.
arXiv Detail & Related papers (2022-11-06T03:48:45Z) - OWL (Observe, Watch, Listen): Localizing Actions in Egocentric Video via
Audiovisual Temporal Context [58.932717614439916]
We take a deep look into the effectiveness of audio in detecting actions in egocentric videos.
We propose a transformer-based model to incorporate temporal audio-visual context.
Our approach achieves state-of-the-art performance on EPIC-KITCHENS-100.
arXiv Detail & Related papers (2022-02-10T10:50:52Z) - Space-Time Memory Network for Sounding Object Localization in Videos [40.45443192327351]
We propose a space-time memory network for sounding object localization in videos.
It can simultaneously learn uni-temporal attention over both uni-temporal and cross-modal representations.
arXiv Detail & Related papers (2021-11-10T04:40:12Z) - Joint Learning of Visual-Audio Saliency Prediction and Sound Source
Localization on Multi-face Videos [101.83513408195692]
We propose a multitask learning method for visual-audio saliency prediction and sound source localization on multi-face video.
The proposed method outperforms 12 state-of-the-art saliency prediction methods, and achieves competitive results in sound source localization.
arXiv Detail & Related papers (2021-11-05T14:35:08Z) - A Review of Sound Source Localization with Deep Learning Methods [71.18444724397486]
This article is a review on deep learning methods for single and multiple sound source localization.
We provide an exhaustive topography of the neural-based localization literature in this context.
Tables summarizing the literature review are provided at the end of the review for a quick search of methods with a given set of target characteristics.
arXiv Detail & Related papers (2021-09-08T07:25:39Z) - Contrastive Learning of Global and Local Audio-Visual Representations [25.557229705149577]
We propose a versatile self-supervised approach to learn audio-visual representations that generalizes to tasks that require global semantic information.
We show that our approach learns generalizable video representations on various downstream scenarios including action/sound classification, lip reading, deepfake detection, and sound source localization.
arXiv Detail & Related papers (2021-04-07T07:35:08Z) - Unsupervised Sound Localization via Iterative Contrastive Learning [106.56167882750792]
We propose an iterative contrastive learning framework that requires no data annotations.
We then use the pseudo-labels to learn the correlation between the visual and audio signals sampled from the same video.
Our iterative strategy gradually encourages the localization of the sounding objects and reduces the correlation between the non-sounding regions and the reference audio.
arXiv Detail & Related papers (2021-04-01T07:48:29Z) - Do We Need Sound for Sound Source Localization? [12.512982702508669]
We develop an unsupervised learning system that solves sound source localization.
We show that visual information is dominant in "sound" source localization when evaluated with the currently adopted benchmark dataset.
We present an evaluation protocol that enforces both visual and aural information to be leveraged.
arXiv Detail & Related papers (2020-07-11T08:57:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.