Sound Localization by Self-Supervised Time Delay Estimation
- URL: http://arxiv.org/abs/2204.12489v1
- Date: Tue, 26 Apr 2022 17:59:01 GMT
- Title: Sound Localization by Self-Supervised Time Delay Estimation
- Authors: Ziyang Chen, David F. Fouhey and Andrew Owens
- Abstract summary: Estimating a sound's time delay requires finding correspondences between the signals recorded by each microphone.
We learn these correspondences through self-supervision, drawing on recent techniques from visual tracking.
We also propose a multimodal contrastive learning model that solves a visually-guided localization task.
- Score: 22.125613860688357
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sounds reach one microphone in a stereo pair sooner than the other, resulting
in an interaural time delay that conveys their directions. Estimating a sound's
time delay requires finding correspondences between the signals recorded by
each microphone. We propose to learn these correspondences through
self-supervision, drawing on recent techniques from visual tracking. We adapt
the contrastive random walk of Jabri et al. to learn a cycle-consistent
representation from unlabeled stereo sounds, resulting in a model that performs
on par with supervised methods on "in the wild" internet recordings. We also
propose a multimodal contrastive learning model that solves a visually-guided
localization task: estimating the time delay for a particular person in a
multi-speaker mixture, given a visual representation of their face. Project
site: https://ificl.github.io/stereocrw/
Related papers
- Tempo estimation as fully self-supervised binary classification [6.255143207183722]
We propose a fully self-supervised approach that does not rely on any human labeled data.
Our method builds on the fact that generic (music) audio embeddings already encode a variety of properties, including information about tempo.
arXiv Detail & Related papers (2024-01-17T00:15:16Z) - Anomalous Sound Detection using Audio Representation with Machine ID
based Contrastive Learning Pretraining [52.191658157204856]
This paper uses contrastive learning to refine audio representations for each machine ID, rather than for each audio sample.
The proposed two-stage method uses contrastive learning to pretrain the audio representation model.
Experiments show that our method outperforms the state-of-the-art methods using contrastive learning or self-supervised classification.
arXiv Detail & Related papers (2023-04-07T11:08:31Z) - Sound Localization from Motion: Jointly Learning Sound Direction and
Camera Rotation [26.867430697990674]
We use images and sounds that undergo subtle but geometrically consistent changes as we rotate our heads to estimate camera rotation and localizing sound sources.
A visual model predicts camera rotation from a pair of images, while an audio model predicts the direction of sound sources from sounds.
We train these models to generate predictions that agree with one another.
Our model can successfully estimate rotations on both real and synthetic scenes, and localize sound sources with accuracy competitive with state-of-the-art self-supervised approaches.
arXiv Detail & Related papers (2023-03-20T17:59:55Z) - Audio-Visual Contrastive Learning with Temporal Self-Supervision [84.11385346896412]
We propose a self-supervised learning approach for videos that learns representations of both the RGB frames and the accompanying audio without human supervision.
To leverage the temporal and aural dimension inherent to videos, our method extends temporal self-supervision to the audio-visual setting.
arXiv Detail & Related papers (2023-02-15T15:00:55Z) - Jointly Learning Visual and Auditory Speech Representations from Raw
Data [108.68531445641769]
RAVEn is a self-supervised multi-modal approach to jointly learn visual and auditory speech representations.
Our design is asymmetric w.r.t. driven by the inherent differences between video and audio.
RAVEn surpasses all self-supervised methods on visual speech recognition.
arXiv Detail & Related papers (2022-12-12T21:04:06Z) - Mix and Localize: Localizing Sound Sources in Mixtures [10.21507741240426]
We present a method for simultaneously localizing multiple sound sources within a visual scene.
Our method jointly solves both tasks at once, using a formulation inspired by the contrastive random walk of Jabri et al.
We show through experiments with musical instruments and human speech that our model can successfully localize multiple sounds.
arXiv Detail & Related papers (2022-11-28T04:30:50Z) - Unsupervised Sound Localization via Iterative Contrastive Learning [106.56167882750792]
We propose an iterative contrastive learning framework that requires no data annotations.
We then use the pseudo-labels to learn the correlation between the visual and audio signals sampled from the same video.
Our iterative strategy gradually encourages the localization of the sounding objects and reduces the correlation between the non-sounding regions and the reference audio.
arXiv Detail & Related papers (2021-04-01T07:48:29Z) - Looking into Your Speech: Learning Cross-modal Affinity for Audio-visual
Speech Separation [73.1652905564163]
We address the problem of separating individual speech signals from videos using audio-visual neural processing.
Most conventional approaches utilize frame-wise matching criteria to extract shared information between co-occurring audio and video.
We propose a cross-modal affinity network (CaffNet) that learns global correspondence as well as locally-varying affinities between audio and visual streams.
arXiv Detail & Related papers (2021-03-25T15:39:12Z) - Audio-visual Speech Separation with Adversarially Disentangled Visual
Representation [23.38624506211003]
Speech separation aims to separate individual voice from an audio mixture of multiple simultaneous talkers.
In our model, we use the face detector to detect the number of speakers in the scene and use visual information to avoid the permutation problem.
Our proposed model is shown to outperform the state-of-the-art audio-only model and three audio-visual models.
arXiv Detail & Related papers (2020-11-29T10:48:42Z) - Look, Listen, and Attend: Co-Attention Network for Self-Supervised
Audio-Visual Representation Learning [17.6311804187027]
An underlying correlation between audio and visual events can be utilized as free supervised information to train a neural network.
We propose a novel self-supervised framework with co-attention mechanism to learn generic cross-modal representations from unlabelled videos.
Experiments show that our model achieves state-of-the-art performance on the pretext task while having fewer parameters compared with existing methods.
arXiv Detail & Related papers (2020-08-13T10:08:12Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.