A cappella: Audio-visual Singing Voice Separation
- URL: http://arxiv.org/abs/2104.09946v1
- Date: Tue, 20 Apr 2021 13:17:06 GMT
- Title: A cappella: Audio-visual Singing Voice Separation
- Authors: Juan F. Montesinos and Venkatesh S. Kadandale and Gloria Haro
- Abstract summary: We explore the single-channel singing voice separation problem from a multimodal perspective.
We present Acappella, a dataset spanning around 46 hours of a cappella solo singing videos sourced from YouTube.
We propose Y-Net, an audio-visual convolutional neural network which achieves state-of-the-art singing voice separation results.
- Score: 4.6453787256723365
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Music source separation can be interpreted as the estimation of the
constituent music sources that a music clip is composed of. In this work, we
explore the single-channel singing voice separation problem from a multimodal
perspective, by jointly learning from audio and visual modalities. To do so, we
present Acappella, a dataset spanning around 46 hours of a cappella solo
singing videos sourced from YouTube. We propose Y-Net, an audio-visual
convolutional neural network which achieves state-of-the-art singing voice
separation results on the Acappella dataset and compare it against its
audio-only counterpart, U-Net, and a state-of-the-art audio-visual speech
separation model. Singing voice separation can be particularly challenging when
the audio mixture also comprises of other accompaniment voices and background
sounds along with the target voice of interest. We demonstrate that our model
can outperform the baseline models in the singing voice separation task in such
challenging scenarios. The code, the pre-trained models and the dataset will be
publicly available at https://ipcv.github.io/Acappella/
Related papers
- Separate Anything You Describe [55.0784713558149]
Language-queried audio source separation (LASS) is a new paradigm for computational auditory scene analysis (CASA)
AudioSep is a foundation model for open-domain audio source separation with natural language queries.
arXiv Detail & Related papers (2023-08-09T16:09:44Z) - Language-Guided Audio-Visual Source Separation via Trimodal Consistency [64.0580750128049]
A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform.
We adapt off-the-shelf vision-language foundation models to provide pseudo-target supervision via two novel loss functions.
We demonstrate the effectiveness of our self-supervised approach on three audio-visual separation datasets.
arXiv Detail & Related papers (2023-03-28T22:45:40Z) - Object Segmentation with Audio Context [0.5243460995467893]
This project explores the multimodal feature aggregation for video instance segmentation task.
We integrate audio features into our video segmentation model to conduct an audio-visual learning scheme.
arXiv Detail & Related papers (2023-01-04T01:33:42Z) - VocaLiST: An Audio-Visual Synchronisation Model for Lips and Voices [4.167459103689587]
We address the problem of lip-voice synchronisation in videos containing human face and voice.
Our approach is based on determining if the lips motion and the voice in a video are synchronised or not.
We propose an audio-visual cross-modal transformer-based model that outperforms several baseline models.
arXiv Detail & Related papers (2022-04-05T10:02:39Z) - VoViT: Low Latency Graph-based Audio-Visual Voice Separation Transformer [4.167459103689587]
This paper presents an audio-visual approach for voice separation.
It outperforms state-of-the-art methods at a low latency in two scenarios: speech and singing voice.
arXiv Detail & Related papers (2022-03-08T14:08:47Z) - A Melody-Unsupervision Model for Singing Voice Synthesis [9.137554315375919]
We propose a melody-unsupervision model that requires only audio-and-lyrics pairs without temporal alignment in training time.
We show that the proposed model is capable of being trained with speech audio and text labels but can generate singing voice in inference time.
arXiv Detail & Related papers (2021-10-13T07:42:35Z) - Visual Scene Graphs for Audio Source Separation [65.47212419514761]
State-of-the-art approaches for visually-guided audio source separation typically assume sources that have characteristic sounds, such as musical instruments.
We propose Audio Visual Scene Graph Segmenter (AVSGS), a novel deep learning model that embeds the visual structure of the scene as a graph and segments this graph into subgraphs.
Our pipeline is trained end-to-end via a self-supervised task consisting of separating audio sources using the visual graph from artificially mixed sounds.
arXiv Detail & Related papers (2021-09-24T13:40:51Z) - Audiovisual Singing Voice Separation [25.862550744570324]
Video model takes the input of mouth movement and fuses it into the feature embeddings of an audio-based separation framework.
We create two audiovisual singing performance datasets for training and evaluation.
The proposed method outperforms audio-based methods in terms of separation quality on most test recordings.
arXiv Detail & Related papers (2021-07-01T06:04:53Z) - VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency [111.55430893354769]
Given a video, the goal is to extract the speech associated with a face in spite of simultaneous background sounds and/or other human speakers.
Our approach jointly learns audio-visual speech separation and cross-modal speaker embeddings from unlabeled video.
It yields state-of-the-art results on five benchmark datasets for audio-visual speech separation and enhancement.
arXiv Detail & Related papers (2021-01-08T18:25:24Z) - Sep-Stereo: Visually Guided Stereophonic Audio Generation by Associating
Source Separation [96.18178553315472]
We propose to leverage the vastly available mono data to facilitate the generation of stereophonic audio.
We integrate both stereo generation and source separation into a unified framework, Sep-Stereo.
arXiv Detail & Related papers (2020-07-20T06:20:26Z) - Learning Speech Representations from Raw Audio by Joint Audiovisual
Self-Supervision [63.564385139097624]
We propose a method to learn self-supervised speech representations from the raw audio waveform.
We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio)
Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.
arXiv Detail & Related papers (2020-07-08T14:07:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.