Self-supervised Neural Audio-Visual Sound Source Localization via
Probabilistic Spatial Modeling
- URL: http://arxiv.org/abs/2007.13976v1
- Date: Tue, 28 Jul 2020 03:52:53 GMT
- Title: Self-supervised Neural Audio-Visual Sound Source Localization via
Probabilistic Spatial Modeling
- Authors: Yoshiki Masuyama, Yoshiaki Bando, Kohei Yatabe, Yoko Sasaki, Masaki
Onishi, Yasuhiro Oikawa
- Abstract summary: This paper presents a self-supervised training method using 360deg images and multichannel audio signals.
By incorporating with the spatial information in multichannel audio signals, our method trains deep neural networks (DNNs) to distinguish multiple sound source objects.
We also demonstrate that the visual DNN detected objects including talking visitors and specific exhibits from real data recorded in a science museum.
- Score: 45.20508569656558
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Detecting sound source objects within visual observation is important for
autonomous robots to comprehend surrounding environments. Since sounding
objects have a large variety with different appearances in our living
environments, labeling all sounding objects is impossible in practice. This
calls for self-supervised learning which does not require manual labeling. Most
of conventional self-supervised learning uses monaural audio signals and images
and cannot distinguish sound source objects having similar appearances due to
poor spatial information in audio signals. To solve this problem, this paper
presents a self-supervised training method using 360{\deg} images and
multichannel audio signals. By incorporating with the spatial information in
multichannel audio signals, our method trains deep neural networks (DNNs) to
distinguish multiple sound source objects. Our system for localizing sound
source objects in the image is composed of audio and visual DNNs. The visual
DNN is trained to localize sound source candidates within an input image. The
audio DNN verifies whether each candidate actually produces sound or not. These
DNNs are jointly trained in a self-supervised manner based on a probabilistic
spatial audio model. Experimental results with simulated data showed that the
DNNs trained by our method localized multiple speakers. We also demonstrate
that the visual DNN detected objects including talking visitors and specific
exhibits from real data recorded in a science museum.
Related papers
- DOA-Aware Audio-Visual Self-Supervised Learning for Sound Event Localization and Detection [16.92604848450722]
This paper describes sound event localization and detection (SELD) for spatial audio recordings captured by firstorder ambisonics (FOA) microphones.
We propose a novel method of pretraining the feature extraction part of the deep neural network (DNN) in a self-supervised manner.
arXiv Detail & Related papers (2024-10-30T08:31:58Z) - Listen2Scene: Interactive material-aware binaural sound propagation for
reconstructed 3D scenes [69.03289331433874]
We present an end-to-end audio rendering approach (Listen2Scene) for virtual reality (VR) and augmented reality (AR) applications.
We propose a novel neural-network-based sound propagation method to generate acoustic effects for 3D models of real environments.
arXiv Detail & Related papers (2023-02-02T04:09:23Z) - Geometry-Aware Multi-Task Learning for Binaural Audio Generation from
Video [94.42811508809994]
We propose an audio spatialization method that draws on visual information in videos to convert their monaural (single-channel) audio to audio.
Whereas existing approaches leverage visual features extracted directly from video frames, our approach explicitly disentangles the geometric cues present in the visual stream to guide the learning process.
arXiv Detail & Related papers (2021-11-21T19:26:45Z) - Visual Scene Graphs for Audio Source Separation [65.47212419514761]
State-of-the-art approaches for visually-guided audio source separation typically assume sources that have characteristic sounds, such as musical instruments.
We propose Audio Visual Scene Graph Segmenter (AVSGS), a novel deep learning model that embeds the visual structure of the scene as a graph and segments this graph into subgraphs.
Our pipeline is trained end-to-end via a self-supervised task consisting of separating audio sources using the visual graph from artificially mixed sounds.
arXiv Detail & Related papers (2021-09-24T13:40:51Z) - Discriminative Sounding Objects Localization via Self-supervised
Audiovisual Matching [87.42246194790467]
We propose a two-stage learning framework to perform self-supervised class-aware sounding object localization.
We show that our model is superior in filtering out silent objects and pointing out the location of sounding objects of different classes.
arXiv Detail & Related papers (2020-10-12T05:51:55Z) - Self-Supervised Learning of Audio-Visual Objects from Video [108.77341357556668]
We introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time.
We demonstrate the effectiveness of the audio-visual object embeddings that our model learns by using them for four downstream speech-oriented tasks.
arXiv Detail & Related papers (2020-08-10T16:18:01Z) - Telling Left from Right: Learning Spatial Correspondence of Sight and
Sound [16.99266133458188]
We propose a novel self-supervised task to leverage a principle: matching spatial information in the audio stream to the positions of sound sources in the visual stream.
We train a model to determine whether the left and right audio channels have been flipped, forcing it to reason about spatial localization across the visual and audio streams.
We demonstrate that understanding spatial correspondence enables models to perform better on three audio-visual tasks, achieving quantitative gains over supervised and self-supervised baselines.
arXiv Detail & Related papers (2020-06-11T04:00:24Z) - Unsupervised Learning of Audio Perception for Robotics Applications:
Learning to Project Data to T-SNE/UMAP space [2.8935588665357077]
This paper builds upon key ideas to build perception of touch sounds without access to any ground-truth data.
We show how we can leverage ideas from classical signal processing to get large amounts of data of any sound of interest with a high precision.
arXiv Detail & Related papers (2020-02-10T20:33:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.