wav2pos: Sound Source Localization using Masked Autoencoders
- URL: http://arxiv.org/abs/2408.15771v1
- Date: Wed, 28 Aug 2024 13:09:20 GMT
- Title: wav2pos: Sound Source Localization using Masked Autoencoders
- Authors: Axel Berg, Jens Gulin, Mark O'Connor, Chuteng Zhou, Karl Åström, Magnus Oskarsson,
- Abstract summary: We present a novel approach to the 3D sound source localization task for distributed ad-hoc microphone arrays by formulating it as a set-to-set regression problem.
We show that such a formulation allows for accurate localization of the sound source, by reconstructing coordinates masked in the input.
- Score: 12.306126455995603
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a novel approach to the 3D sound source localization task for distributed ad-hoc microphone arrays by formulating it as a set-to-set regression problem. By training a multi-modal masked autoencoder model that operates on audio recordings and microphone coordinates, we show that such a formulation allows for accurate localization of the sound source, by reconstructing coordinates masked in the input. Our approach is flexible in the sense that a single model can be used with an arbitrary number of microphones, even when a subset of audio recordings and microphone coordinates are missing. We test our method on simulated and real-world recordings of music and speech in indoor environments, and demonstrate competitive performance compared to both classical and other learning based localization methods.
Related papers
- Multimodal Data and Resource Efficient Device-Directed Speech Detection
with Large Foundation Models [43.155061160275196]
We explore the possibility of making interactions with virtual assistants more natural by eliminating the need for a trigger phrase.
Our goal is to determine whether a user addressed the virtual assistant based on signals obtained from the streaming audio recorded by the device microphone.
We address this task by combining 1-best hypotheses and decoder signals from an automatic speech recognition system with acoustic representations from an audio encoder.
arXiv Detail & Related papers (2023-12-06T17:29:03Z) - Can CLIP Help Sound Source Localization? [19.370071553914954]
We introduce a framework that translates audio signals into tokens compatible with CLIP's text encoder.
By directly using these embeddings, our method generates audio-grounded masks for the provided audio.
Our findings suggest that utilizing pre-trained image-text models enable our model to generate more complete and compact localization maps for the sounding objects.
arXiv Detail & Related papers (2023-11-07T15:26:57Z) - AdVerb: Visually Guided Audio Dereverberation [49.958724234969445]
We present AdVerb, a novel audio-visual dereverberation framework.
It uses visual cues in addition to the reverberant sound to estimate clean audio.
arXiv Detail & Related papers (2023-08-23T18:20:59Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - Self-Supervised Visual Acoustic Matching [63.492168778869726]
Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment.
We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio.
Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric.
arXiv Detail & Related papers (2023-07-27T17:59:59Z) - Mix and Localize: Localizing Sound Sources in Mixtures [10.21507741240426]
We present a method for simultaneously localizing multiple sound sources within a visual scene.
Our method jointly solves both tasks at once, using a formulation inspired by the contrastive random walk of Jabri et al.
We show through experiments with musical instruments and human speech that our model can successfully localize multiple sounds.
arXiv Detail & Related papers (2022-11-28T04:30:50Z) - Disentangling speech from surroundings with neural embeddings [17.958451380305892]
We present a method to separate speech signals from noisy environments in the embedding space of a neural audio.
We introduce a new training procedure that allows our model to produce structured encodings of audio waveforms given by embedding vectors.
arXiv Detail & Related papers (2022-03-29T13:58:33Z) - CAESynth: Real-Time Timbre Interpolation and Pitch Control with
Conditional Autoencoders [3.0991538386316666]
CAE Synth synthesizes timbre in real-time by interpolating the reference sounds in their shared latent feature space.
We show that training a conditional autoencoder based on accuracy in timbre classification together with adversarial regularization of pitch content allows timbre distribution in latent space to be more effective.
arXiv Detail & Related papers (2021-11-09T14:36:31Z) - Multi-Channel End-to-End Neural Diarization with Distributed Microphones [53.99406868339701]
We replace Transformer encoders in EEND with two types of encoders that process a multi-channel input.
We also propose a model adaptation method using only single-channel recordings.
arXiv Detail & Related papers (2021-10-10T03:24:03Z) - Data Fusion for Audiovisual Speaker Localization: Extending Dynamic
Stream Weights to the Spatial Domain [103.3388198420822]
Esting the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization.
This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions.
A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.
arXiv Detail & Related papers (2021-02-23T09:59:31Z) - Unsupervised Cross-Domain Singing Voice Conversion [105.1021715879586]
We present a wav-to-wav generative model for the task of singing voice conversion from any identity.
Our method utilizes both an acoustic model, trained for the task of automatic speech recognition, together with melody extracted features to drive a waveform-based generator.
arXiv Detail & Related papers (2020-08-06T18:29:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.