CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled
Videos
- URL: http://arxiv.org/abs/2212.07065v1
- Date: Wed, 14 Dec 2022 07:21:45 GMT
- Title: CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled
Videos
- Authors: Hao-Wen Dong, Naoya Takahashi, Yuki Mitsufuji, Julian McAuley, Taylor
Berg-Kirkpatrick
- Abstract summary: We propose to approach text-queried universal sound separation by using only unlabeled data.
The proposed CLIPSep model first encodes the input query into a query vector using the contrastive language-image pretraining (CLIP) model.
While the model is trained on image-audio pairs extracted from unlabeled videos, at test time we can instead query the model with text inputs in a zero-shot setting.
- Score: 44.14061539284888
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent years have seen progress beyond domain-specific sound separation for
speech or music towards universal sound separation for arbitrary sounds. Prior
work on universal sound separation has investigated separating a target sound
out of an audio mixture given a text query. Such text-queried sound separation
systems provide a natural and scalable interface for specifying arbitrary
target sounds. However, supervised text-queried sound separation systems
require costly labeled audio-text pairs for training. Moreover, the audio
provided in existing datasets is often recorded in a controlled environment,
causing a considerable generalization gap to noisy audio in the wild. In this
work, we aim to approach text-queried universal sound separation by using only
unlabeled data. We propose to leverage the visual modality as a bridge to learn
the desired audio-textual correspondence. The proposed CLIPSep model first
encodes the input query into a query vector using the contrastive
language-image pretraining (CLIP) model, and the query vector is then used to
condition an audio separation model to separate out the target sound. While the
model is trained on image-audio pairs extracted from unlabeled videos, at test
time we can instead query the model with text inputs in a zero-shot setting,
thanks to the joint language-image embedding learned by the CLIP model.
Further, videos in the wild often contain off-screen sounds and background
noise that may hinder the model from learning the desired audio-textual
correspondence. To address this problem, we further propose an approach called
noise invariant training for training a query-based sound separation model on
noisy data. Experimental results show that the proposed models successfully
learn text-queried universal sound separation using only noisy unlabeled
videos, even achieving competitive performance against a supervised model in
some settings.
Related papers
- Separate Anything You Describe [55.0784713558149]
Language-queried audio source separation (LASS) is a new paradigm for computational auditory scene analysis (CASA)
AudioSep is a foundation model for open-domain audio source separation with natural language queries.
arXiv Detail & Related papers (2023-08-09T16:09:44Z) - CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained
Language-Vision Models [50.42886595228255]
We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge.
We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining model.
arXiv Detail & Related papers (2023-06-16T05:42:01Z) - Language-Guided Audio-Visual Source Separation via Trimodal Consistency [64.0580750128049]
A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform.
We adapt off-the-shelf vision-language foundation models to provide pseudo-target supervision via two novel loss functions.
We demonstrate the effectiveness of our self-supervised approach on three audio-visual separation datasets.
arXiv Detail & Related papers (2023-03-28T22:45:40Z) - Zero-shot Audio Source Separation through Query-based Learning from
Weakly-labeled Data [26.058278155958668]
We propose a three-component pipeline to train a universal audio source separator from a large, but weakly-labeled dataset: AudioSet.
Our approach uses a single model for source separation of multiple sound types, and relies solely on weakly-labeled data for training.
The proposed audio separator can be used in a zero-shot setting, learning to separate types of audio sources that were never seen in training.
arXiv Detail & Related papers (2021-12-15T05:13:43Z) - Using multiple reference audios and style embedding constraints for
speech synthesis [68.62945852651383]
The proposed model can improve the speech naturalness and content quality with multiple reference audios.
The model can also outperform the baseline model in ABX preference tests of style similarity.
arXiv Detail & Related papers (2021-10-09T04:24:29Z) - Visual Scene Graphs for Audio Source Separation [65.47212419514761]
State-of-the-art approaches for visually-guided audio source separation typically assume sources that have characteristic sounds, such as musical instruments.
We propose Audio Visual Scene Graph Segmenter (AVSGS), a novel deep learning model that embeds the visual structure of the scene as a graph and segments this graph into subgraphs.
Our pipeline is trained end-to-end via a self-supervised task consisting of separating audio sources using the visual graph from artificially mixed sounds.
arXiv Detail & Related papers (2021-09-24T13:40:51Z) - Weakly-supervised Audio-visual Sound Source Detection and Separation [38.52168086518221]
We propose an audio-visual co-segmentation, where the network learns both what individual objects look and sound like.
We introduce weakly-supervised object segmentation in the context of sound separation.
Our architecture can be learned in an end-to-end manner and requires no additional supervision or bounding box proposals.
arXiv Detail & Related papers (2021-03-25T10:17:55Z) - Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of
On-Screen Sounds [33.4237979175049]
We present AudioScope, a novel audio-visual sound separation framework.
It can be trained without supervision to isolate on-screen sound sources from real in-the-wild videos.
We demonstrate the effectiveness of our approach using a dataset of video clips extracted from open-domain YFCC100m video data.
arXiv Detail & Related papers (2020-11-02T17:36:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.