Content based singing voice source separation via strong conditioning
using aligned phonemes
- URL: http://arxiv.org/abs/2008.02070v1
- Date: Wed, 5 Aug 2020 12:25:24 GMT
- Title: Content based singing voice source separation via strong conditioning
using aligned phonemes
- Authors: Gabriel Meseguer-Brocal, Geoffroy Peeters
- Abstract summary: In this paper, we present a multimodal multitrack dataset with lyrics aligned in time at the word level with phonetic information.
We show that phoneme conditioning can be successfully applied to improve singing voice source separation.
- Score: 7.599399338954308
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Informed source separation has recently gained renewed interest with the
introduction of neural networks and the availability of large multitrack
datasets containing both the mixture and the separated sources. These
approaches use prior information about the target source to improve separation.
Historically, Music Information Retrieval researchers have focused primarily on
score-informed source separation, but more recent approaches explore
lyrics-informed source separation. However, because of the lack of multitrack
datasets with time-aligned lyrics, models use weak conditioning with
non-aligned lyrics. In this paper, we present a multimodal multitrack dataset
with lyrics aligned in time at the word level with phonetic information as well
as explore strong conditioning using the aligned phonemes. Our model follows a
U-Net architecture and takes as input both the magnitude spectrogram of a
musical mixture and a matrix with aligned phonetic information. The phoneme
matrix is embedded to obtain the parameters that control Feature-wise Linear
Modulation (FiLM) layers. These layers condition the U-Net feature maps to
adapt the separation process to the presence of different phonemes via affine
transformations. We show that phoneme conditioning can be successfully applied
to improve singing voice source separation.
Related papers
- MedleyVox: An Evaluation Dataset for Multiple Singing Voices Separation [10.456845656569444]
Separation of multiple singing voices into each voice is rarely studied in music source separation research.
We introduce MedleyVox, an evaluation dataset for multiple singing voices separation.
We present a strategy for construction of multiple singing mixtures using various single-singing datasets.
arXiv Detail & Related papers (2022-11-14T12:27:35Z) - VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix [59.25846149124199]
This paper proposes a data augmentation method, namely cross-modal CutMix.
CMC transforms natural sentences from the textual view into a multi-modal view.
By attaching cross-modal noise on uni-modal data, it guides models to learn token-level interactions across modalities for better denoising.
arXiv Detail & Related papers (2022-06-17T17:56:47Z) - Audio-text Retrieval in Context [24.38055340045366]
In this work, we investigate several audio features as well as sequence aggregation methods for better audio-text alignment.
We build our contextual audio-text retrieval system using pre-trained audio features and a descriptor-based aggregation method.
With our proposed system, a significant improvement has been achieved on bidirectional audio-text retrieval, on all metrics including recall, median and mean rank.
arXiv Detail & Related papers (2022-03-25T13:41:17Z) - Visual Scene Graphs for Audio Source Separation [65.47212419514761]
State-of-the-art approaches for visually-guided audio source separation typically assume sources that have characteristic sounds, such as musical instruments.
We propose Audio Visual Scene Graph Segmenter (AVSGS), a novel deep learning model that embeds the visual structure of the scene as a graph and segments this graph into subgraphs.
Our pipeline is trained end-to-end via a self-supervised task consisting of separating audio sources using the visual graph from artificially mixed sounds.
arXiv Detail & Related papers (2021-09-24T13:40:51Z) - MSTRE-Net: Multistreaming Acoustic Modeling for Automatic Lyrics
Transcription [8.669338893753885]
This paper makes several contributions to automatic lyrics transcription (ALT) research.
Our main contribution is a novel variant of the Multistreaming Time-Delay Neural Network (MTDNN) architecture, called MSTRE-Net.
We present a new test set with a considerably larger size and a higher musical variability compared to the existing datasets used in ALT.
arXiv Detail & Related papers (2021-08-05T13:59:11Z) - Source Separation and Depthwise Separable Convolutions for Computer
Audition [0.0]
We train a depthwise separable convolutional neural network on a challenging electronic dance music data set.
It is shown that source separation improves classification performance in a limited-data setting compared to the standard single spectrogram approach.
arXiv Detail & Related papers (2020-12-06T19:30:26Z) - Decoupling Pronunciation and Language for End-to-end Code-switching
Automatic Speech Recognition [66.47000813920617]
We propose a decoupled transformer model to use monolingual paired data and unpaired text data.
The model is decoupled into two parts: audio-to-phoneme (A2P) network and phoneme-to-text (P2T) network.
By using monolingual data and unpaired text data, the decoupled transformer model reduces the high dependency on code-switching paired training data of E2E model.
arXiv Detail & Related papers (2020-10-28T07:46:15Z) - Multi-microphone Complex Spectral Mapping for Utterance-wise and
Continuous Speech Separation [79.63545132515188]
We propose multi-microphone complex spectral mapping for speaker separation in reverberant conditions.
Our system is trained on simulated room impulse responses based on a fixed number of microphones arranged in a given geometry.
State-of-the-art separation performance is obtained on the simulated two-talker SMS-WSJ corpus and the real-recorded LibriCSS dataset.
arXiv Detail & Related papers (2020-10-04T22:13:13Z) - Unsupervised Cross-Modal Audio Representation Learning from Unstructured
Multilingual Text [69.55642178336953]
We present an approach to unsupervised audio representation learning.
Based on a triplet neural network architecture, we harnesses semantically related cross-modal information to estimate audio track-relatedness.
We show that our approach is invariant to the variety of annotation styles as well as to the different languages of this collection.
arXiv Detail & Related papers (2020-03-27T07:37:15Z) - Continuous speech separation: dataset and analysis [52.10378896407332]
In natural conversations, a speech signal is continuous, containing both overlapped and overlap-free components.
This paper describes a dataset and protocols for evaluating continuous speech separation algorithms.
arXiv Detail & Related papers (2020-01-30T18:01:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.