Audiovisual Singing Voice Separation
- URL: http://arxiv.org/abs/2107.00231v1
- Date: Thu, 1 Jul 2021 06:04:53 GMT
- Title: Audiovisual Singing Voice Separation
- Authors: Bochen Li, Yuxuan Wang, and Zhiyao Duan
- Abstract summary: Video model takes the input of mouth movement and fuses it into the feature embeddings of an audio-based separation framework.
We create two audiovisual singing performance datasets for training and evaluation.
The proposed method outperforms audio-based methods in terms of separation quality on most test recordings.
- Score: 25.862550744570324
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Separating a song into vocal and accompaniment components is an active
research topic, and recent years witnessed an increased performance from
supervised training using deep learning techniques. We propose to apply the
visual information corresponding to the singers' vocal activities to further
improve the quality of the separated vocal signals. The video frontend model
takes the input of mouth movement and fuses it into the feature embeddings of
an audio-based separation framework. To facilitate the network to learn
audiovisual correlation of singing activities, we add extra vocal signals
irrelevant to the mouth movement to the audio mixture during training. We
create two audiovisual singing performance datasets for training and
evaluation, respectively, one curated from audition recordings on the Internet,
and the other recorded in house. The proposed method outperforms audio-based
methods in terms of separation quality on most test recordings. This advantage
is especially pronounced when there are backing vocals in the accompaniment,
which poses a great challenge for audio-only methods.
Related papers
- Singer Identity Representation Learning using Self-Supervised Techniques [0.0]
We propose a framework for training singer identity encoders to extract representations suitable for various singing-related tasks.
We explore different self-supervised learning techniques on a large collection of isolated vocal tracks.
We evaluate the quality of the resulting representations on singer similarity and identification tasks.
arXiv Detail & Related papers (2024-01-10T10:41:38Z) - Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos [69.79632907349489]
We propose a self-supervised method for learning representations based on spatial audio-visual correspondences in egocentric videos.
Our method uses a masked auto-encoding framework to synthesize masked (multi-channel) audio through the synergy of audio and vision.
arXiv Detail & Related papers (2023-07-10T17:58:17Z) - Language-Guided Audio-Visual Source Separation via Trimodal Consistency [64.0580750128049]
A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform.
We adapt off-the-shelf vision-language foundation models to provide pseudo-target supervision via two novel loss functions.
We demonstrate the effectiveness of our self-supervised approach on three audio-visual separation datasets.
arXiv Detail & Related papers (2023-03-28T22:45:40Z) - VoViT: Low Latency Graph-based Audio-Visual Voice Separation Transformer [4.167459103689587]
This paper presents an audio-visual approach for voice separation.
It outperforms state-of-the-art methods at a low latency in two scenarios: speech and singing voice.
arXiv Detail & Related papers (2022-03-08T14:08:47Z) - Learning music audio representations via weak language supervision [14.335950077921435]
We design a multimodal architecture for music and language pre-training (MuLaP) optimised via a set of proxy tasks.
weak supervision is provided in the form of noisy natural language descriptions conveying the overall musical content of the track.
We demonstrate the usefulness of our approach by comparing the performance of audio representations produced by the same audio backbone with different training strategies.
arXiv Detail & Related papers (2021-12-08T10:30:52Z) - A cappella: Audio-visual Singing Voice Separation [4.6453787256723365]
We explore the single-channel singing voice separation problem from a multimodal perspective.
We present Acappella, a dataset spanning around 46 hours of a cappella solo singing videos sourced from YouTube.
We propose Y-Net, an audio-visual convolutional neural network which achieves state-of-the-art singing voice separation results.
arXiv Detail & Related papers (2021-04-20T13:17:06Z) - VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency [111.55430893354769]
Given a video, the goal is to extract the speech associated with a face in spite of simultaneous background sounds and/or other human speakers.
Our approach jointly learns audio-visual speech separation and cross-modal speaker embeddings from unlabeled video.
It yields state-of-the-art results on five benchmark datasets for audio-visual speech separation and enhancement.
arXiv Detail & Related papers (2021-01-08T18:25:24Z) - Learning Speech Representations from Raw Audio by Joint Audiovisual
Self-Supervision [63.564385139097624]
We propose a method to learn self-supervised speech representations from the raw audio waveform.
We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio)
Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.
arXiv Detail & Related papers (2020-07-08T14:07:06Z) - Music Gesture for Visual Sound Separation [121.36275456396075]
"Music Gesture" is a keypoint-based structured representation to explicitly model the body and finger movements of musicians when they perform music.
We first adopt a context-aware graph network to integrate visual semantic context with body dynamics, and then apply an audio-visual fusion model to associate body movements with the corresponding audio signals.
arXiv Detail & Related papers (2020-04-20T17:53:46Z) - Visually Guided Self Supervised Learning of Speech Representations [62.23736312957182]
We propose a framework for learning audio representations guided by the visual modality in the context of audiovisual speech.
We employ a generative audio-to-video training scheme in which we animate a still image corresponding to a given audio clip and optimize the generated video to be as close as possible to the real video of the speech segment.
We achieve state of the art results for emotion recognition and competitive results for speech recognition.
arXiv Detail & Related papers (2020-01-13T14:53:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.