Mix and Localize: Localizing Sound Sources in Mixtures
- URL: http://arxiv.org/abs/2211.15058v1
- Date: Mon, 28 Nov 2022 04:30:50 GMT
- Title: Mix and Localize: Localizing Sound Sources in Mixtures
- Authors: Xixi Hu, Ziyang Chen, Andrew Owens
- Abstract summary: We present a method for simultaneously localizing multiple sound sources within a visual scene.
Our method jointly solves both tasks at once, using a formulation inspired by the contrastive random walk of Jabri et al.
We show through experiments with musical instruments and human speech that our model can successfully localize multiple sounds.
- Score: 10.21507741240426
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a method for simultaneously localizing multiple sound sources
within a visual scene. This task requires a model to both group a sound mixture
into individual sources, and to associate them with a visual signal. Our method
jointly solves both tasks at once, using a formulation inspired by the
contrastive random walk of Jabri et al. We create a graph in which images and
separated sounds correspond to nodes, and train a random walker to transition
between nodes from different modalities with high return probability. The
transition probabilities for this walk are determined by an audio-visual
similarity metric that is learned by our model. We show through experiments
with musical instruments and human speech that our model can successfully
localize multiple sounds, outperforming other self-supervised methods. Project
site: https://hxixixh.github.io/mix-and-localize
Related papers
- wav2pos: Sound Source Localization using Masked Autoencoders [12.306126455995603]
We present a novel approach to the 3D sound source localization task for distributed ad-hoc microphone arrays by formulating it as a set-to-set regression problem.
We show that such a formulation allows for accurate localization of the sound source, by reconstructing coordinates masked in the input.
arXiv Detail & Related papers (2024-08-28T13:09:20Z) - Sound Localization from Motion: Jointly Learning Sound Direction and
Camera Rotation [26.867430697990674]
We use images and sounds that undergo subtle but geometrically consistent changes as we rotate our heads to estimate camera rotation and localizing sound sources.
A visual model predicts camera rotation from a pair of images, while an audio model predicts the direction of sound sources from sounds.
We train these models to generate predictions that agree with one another.
Our model can successfully estimate rotations on both real and synthetic scenes, and localize sound sources with accuracy competitive with state-of-the-art self-supervised approaches.
arXiv Detail & Related papers (2023-03-20T17:59:55Z) - Multi-Source Diffusion Models for Simultaneous Music Generation and Separation [17.124189082882395]
We train our model on Slakh2100, a standard dataset for musical source separation.
Our method is the first example of a single model that can handle both generation and separation tasks.
arXiv Detail & Related papers (2023-02-04T23:18:36Z) - Separate And Diffuse: Using a Pretrained Diffusion Model for Improving
Source Separation [99.19786288094596]
We show how the upper bound can be generalized to the case of random generative models.
We show state-of-the-art results on 2, 3, 5, 10, and 20 speakers on multiple benchmarks.
arXiv Detail & Related papers (2023-01-25T18:21:51Z) - Decoupled Mixup for Generalized Visual Recognition [71.13734761715472]
We propose a novel "Decoupled-Mixup" method to train CNN models for visual recognition.
Our method decouples each image into discriminative and noise-prone regions, and then heterogeneously combines these regions to train CNN models.
Experiment results show the high generalization performance of our method on testing data that are composed of unseen contexts.
arXiv Detail & Related papers (2022-10-26T15:21:39Z) - Sound Localization by Self-Supervised Time Delay Estimation [22.125613860688357]
Estimating a sound's time delay requires finding correspondences between the signals recorded by each microphone.
We learn these correspondences through self-supervision, drawing on recent techniques from visual tracking.
We also propose a multimodal contrastive learning model that solves a visually-guided localization task.
arXiv Detail & Related papers (2022-04-26T17:59:01Z) - Joint Learning of Visual-Audio Saliency Prediction and Sound Source
Localization on Multi-face Videos [101.83513408195692]
We propose a multitask learning method for visual-audio saliency prediction and sound source localization on multi-face video.
The proposed method outperforms 12 state-of-the-art saliency prediction methods, and achieves competitive results in sound source localization.
arXiv Detail & Related papers (2021-11-05T14:35:08Z) - Visual Scene Graphs for Audio Source Separation [65.47212419514761]
State-of-the-art approaches for visually-guided audio source separation typically assume sources that have characteristic sounds, such as musical instruments.
We propose Audio Visual Scene Graph Segmenter (AVSGS), a novel deep learning model that embeds the visual structure of the scene as a graph and segments this graph into subgraphs.
Our pipeline is trained end-to-end via a self-supervised task consisting of separating audio sources using the visual graph from artificially mixed sounds.
arXiv Detail & Related papers (2021-09-24T13:40:51Z) - A Unified Model for Zero-shot Music Source Separation, Transcription and
Synthesis [13.263771543118994]
We propose a unified model for three inter-related tasks: 1) to textitseparate individual sound sources from a mixed music audio, 2) to textittranscribe each sound source to MIDI notes, and 3) totextit synthesize new pieces based on the timbre of separated sources.
The model is inspired by the fact that when humans listen to music, our minds can not only separate the sounds of different instruments, but also at the same time perceive high-level representations such as score and timbre.
arXiv Detail & Related papers (2021-08-07T14:28:21Z) - Unsupervised Sound Localization via Iterative Contrastive Learning [106.56167882750792]
We propose an iterative contrastive learning framework that requires no data annotations.
We then use the pseudo-labels to learn the correlation between the visual and audio signals sampled from the same video.
Our iterative strategy gradually encourages the localization of the sounding objects and reduces the correlation between the non-sounding regions and the reference audio.
arXiv Detail & Related papers (2021-04-01T07:48:29Z) - Unsupervised Cross-Domain Singing Voice Conversion [105.1021715879586]
We present a wav-to-wav generative model for the task of singing voice conversion from any identity.
Our method utilizes both an acoustic model, trained for the task of automatic speech recognition, together with melody extracted features to drive a waveform-based generator.
arXiv Detail & Related papers (2020-08-06T18:29:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.