Self-Supervised Learning from Automatically Separated Sound Scenes
- URL: http://arxiv.org/abs/2105.02132v1
- Date: Wed, 5 May 2021 15:37:17 GMT
- Title: Self-Supervised Learning from Automatically Separated Sound Scenes
- Authors: Eduardo Fonseca, Aren Jansen, Daniel P. W. Ellis, Scott Wisdom, Marco
Tagliasacchi, John R. Hershey, Manoj Plakal, Shawn Hershey, R. Channing
Moore, Xavier Serra
- Abstract summary: This paper explores the use of unsupervised automatic sound separation to decompose unlabeled sound scenes into semantically-linked views.
We find that learning to associate input mixtures with their automatically separated outputs yields stronger representations than past approaches.
- Score: 38.71803524843168
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Real-world sound scenes consist of time-varying collections of sound sources,
each generating characteristic sound events that are mixed together in audio
recordings. The association of these constituent sound events with their
mixture and each other is semantically constrained: the sound scene contains
the union of source classes and not all classes naturally co-occur. With this
motivation, this paper explores the use of unsupervised automatic sound
separation to decompose unlabeled sound scenes into multiple
semantically-linked views for use in self-supervised contrastive learning. We
find that learning to associate input mixtures with their automatically
separated outputs yields stronger representations than past approaches that use
the mixtures alone. Further, we discover that optimal source separation is not
required for successful contrastive learning by demonstrating that a range of
separation system convergence states all lead to useful and often complementary
example transformations. Our best system incorporates these unsupervised
separation models into a single augmentation front-end and jointly optimizes
similarity maximization and coincidence prediction objectives across the views.
The result is an unsupervised audio representation that rivals state-of-the-art
alternatives on the established shallow AudioSet classification benchmark.
Related papers
- Universal Sound Separation with Self-Supervised Audio Masked Autoencoder [35.560261097213846]
We propose integrating a self-supervised pre-trained model, namely the audio masked autoencoder (A-MAE), into a universal sound separation system.
The proposed methods successfully enhance the separation performance of a state-of-the-art ResUNet-based USS model.
arXiv Detail & Related papers (2024-07-16T14:11:44Z) - Separating Invisible Sounds Toward Universal Audiovisual Scene-Aware
Sound Separation [51.06562260845748]
This paper introduces a novel "Audio-Visual Scene-Aware Separation" framework.
It includes a semantic for visible and invisible sounds and a separator for scene-informed separation.
AVSA-Sep successfully separates both sound types, with joint training and cross-modal alignment enhancing effectiveness.
arXiv Detail & Related papers (2023-10-18T05:03:57Z) - Self-Supervised Visual Acoustic Matching [63.492168778869726]
Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment.
We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio.
Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric.
arXiv Detail & Related papers (2023-07-27T17:59:59Z) - CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled
Videos [44.14061539284888]
We propose to approach text-queried universal sound separation by using only unlabeled data.
The proposed CLIPSep model first encodes the input query into a query vector using the contrastive language-image pretraining (CLIP) model.
While the model is trained on image-audio pairs extracted from unlabeled videos, at test time we can instead query the model with text inputs in a zero-shot setting.
arXiv Detail & Related papers (2022-12-14T07:21:45Z) - Visual Scene Graphs for Audio Source Separation [65.47212419514761]
State-of-the-art approaches for visually-guided audio source separation typically assume sources that have characteristic sounds, such as musical instruments.
We propose Audio Visual Scene Graph Segmenter (AVSGS), a novel deep learning model that embeds the visual structure of the scene as a graph and segments this graph into subgraphs.
Our pipeline is trained end-to-end via a self-supervised task consisting of separating audio sources using the visual graph from artificially mixed sounds.
arXiv Detail & Related papers (2021-09-24T13:40:51Z) - Cyclic Co-Learning of Sounding Object Visual Grounding and Sound
Separation [52.550684208734324]
We propose a cyclic co-learning paradigm that can jointly learn sounding object visual grounding and audio-visual sound separation.
In this paper, we show that the proposed framework outperforms the compared recent approaches on both tasks.
arXiv Detail & Related papers (2021-04-05T17:30:41Z) - Unsupervised Sound Localization via Iterative Contrastive Learning [106.56167882750792]
We propose an iterative contrastive learning framework that requires no data annotations.
We then use the pseudo-labels to learn the correlation between the visual and audio signals sampled from the same video.
Our iterative strategy gradually encourages the localization of the sounding objects and reduces the correlation between the non-sounding regions and the reference audio.
arXiv Detail & Related papers (2021-04-01T07:48:29Z) - Weakly-supervised Audio-visual Sound Source Detection and Separation [38.52168086518221]
We propose an audio-visual co-segmentation, where the network learns both what individual objects look and sound like.
We introduce weakly-supervised object segmentation in the context of sound separation.
Our architecture can be learned in an end-to-end manner and requires no additional supervision or bounding box proposals.
arXiv Detail & Related papers (2021-03-25T10:17:55Z) - Foreground-Background Ambient Sound Scene Separation [0.0]
We propose a deep learning-based separation framework with a suitable feature normaliza-tion scheme and an optional auxiliary network capturing the background statistics.
We conduct extensive experiments with mixtures of seen or unseen sound classes at various signal-to-noise ratios.
arXiv Detail & Related papers (2020-05-11T06:59:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.