Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of
On-Screen Sounds
- URL: http://arxiv.org/abs/2011.01143v2
- Date: Sun, 30 May 2021 03:47:08 GMT
- Title: Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of
On-Screen Sounds
- Authors: Efthymios Tzinis, Scott Wisdom, Aren Jansen, Shawn Hershey, Tal Remez,
Daniel P. W. Ellis, John R. Hershey
- Abstract summary: We present AudioScope, a novel audio-visual sound separation framework.
It can be trained without supervision to isolate on-screen sound sources from real in-the-wild videos.
We demonstrate the effectiveness of our approach using a dataset of video clips extracted from open-domain YFCC100m video data.
- Score: 33.4237979175049
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent progress in deep learning has enabled many advances in sound
separation and visual scene understanding. However, extracting sound sources
which are apparent in natural videos remains an open problem. In this work, we
present AudioScope, a novel audio-visual sound separation framework that can be
trained without supervision to isolate on-screen sound sources from real
in-the-wild videos. Prior audio-visual separation work assumed artificial
limitations on the domain of sound classes (e.g., to speech or music),
constrained the number of sources, and required strong sound separation or
visual segmentation labels. AudioScope overcomes these limitations, operating
on an open domain of sounds, with variable numbers of sources, and without
labels or prior visual segmentation. The training procedure for AudioScope uses
mixture invariant training (MixIT) to separate synthetic mixtures of mixtures
(MoMs) into individual sources, where noisy labels for mixtures are provided by
an unsupervised audio-visual coincidence model. Using the noisy labels, along
with attention between video and audio features, AudioScope learns to identify
audio-visual similarity and to suppress off-screen sounds. We demonstrate the
effectiveness of our approach using a dataset of video clips extracted from
open-domain YFCC100m video data. This dataset contains a wide diversity of
sound classes recorded in unconstrained conditions, making the application of
previous methods unsuitable. For evaluation and semi-supervised experiments, we
collected human labels for presence of on-screen and off-screen sounds on a
small subset of clips.
Related papers
- Separating Invisible Sounds Toward Universal Audiovisual Scene-Aware
Sound Separation [51.06562260845748]
This paper introduces a novel "Audio-Visual Scene-Aware Separation" framework.
It includes a semantic for visible and invisible sounds and a separator for scene-informed separation.
AVSA-Sep successfully separates both sound types, with joint training and cross-modal alignment enhancing effectiveness.
arXiv Detail & Related papers (2023-10-18T05:03:57Z) - Separate Anything You Describe [55.0784713558149]
Language-queried audio source separation (LASS) is a new paradigm for computational auditory scene analysis (CASA)
AudioSep is a foundation model for open-domain audio source separation with natural language queries.
arXiv Detail & Related papers (2023-08-09T16:09:44Z) - Self-Supervised Visual Acoustic Matching [63.492168778869726]
Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment.
We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio.
Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric.
arXiv Detail & Related papers (2023-07-27T17:59:59Z) - Language-Guided Audio-Visual Source Separation via Trimodal Consistency [64.0580750128049]
A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform.
We adapt off-the-shelf vision-language foundation models to provide pseudo-target supervision via two novel loss functions.
We demonstrate the effectiveness of our self-supervised approach on three audio-visual separation datasets.
arXiv Detail & Related papers (2023-03-28T22:45:40Z) - CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled
Videos [44.14061539284888]
We propose to approach text-queried universal sound separation by using only unlabeled data.
The proposed CLIPSep model first encodes the input query into a query vector using the contrastive language-image pretraining (CLIP) model.
While the model is trained on image-audio pairs extracted from unlabeled videos, at test time we can instead query the model with text inputs in a zero-shot setting.
arXiv Detail & Related papers (2022-12-14T07:21:45Z) - Visual Sound Localization in the Wild by Cross-Modal Interference
Erasing [90.21476231683008]
In real-world scenarios, audios are usually contaminated by off-screen sound and background noise.
We propose the Interference Eraser (IEr) framework, which tackles the problem of audio-visual sound source localization in the wild.
arXiv Detail & Related papers (2022-02-13T21:06:19Z) - Visual Scene Graphs for Audio Source Separation [65.47212419514761]
State-of-the-art approaches for visually-guided audio source separation typically assume sources that have characteristic sounds, such as musical instruments.
We propose Audio Visual Scene Graph Segmenter (AVSGS), a novel deep learning model that embeds the visual structure of the scene as a graph and segments this graph into subgraphs.
Our pipeline is trained end-to-end via a self-supervised task consisting of separating audio sources using the visual graph from artificially mixed sounds.
arXiv Detail & Related papers (2021-09-24T13:40:51Z) - Exploiting Audio-Visual Consistency with Partial Supervision for Spatial
Audio Generation [45.526051369551915]
We propose an audio spatialization framework to convert a monaural video into a one exploiting the relationship across audio and visual components.
Experiments on benchmark datasets confirm the effectiveness of our proposed framework in both semi-supervised and fully supervised scenarios.
arXiv Detail & Related papers (2021-05-03T09:34:11Z) - Weakly-supervised Audio-visual Sound Source Detection and Separation [38.52168086518221]
We propose an audio-visual co-segmentation, where the network learns both what individual objects look and sound like.
We introduce weakly-supervised object segmentation in the context of sound separation.
Our architecture can be learned in an end-to-end manner and requires no additional supervision or bounding box proposals.
arXiv Detail & Related papers (2021-03-25T10:17:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.