MedleyVox: An Evaluation Dataset for Multiple Singing Voices Separation
- URL: http://arxiv.org/abs/2211.07302v2
- Date: Thu, 4 May 2023 14:13:42 GMT
- Title: MedleyVox: An Evaluation Dataset for Multiple Singing Voices Separation
- Authors: Chang-Bin Jeon, Hyeongi Moon, Keunwoo Choi, Ben Sangbae Chon, and
Kyogu Lee
- Abstract summary: Separation of multiple singing voices into each voice is rarely studied in music source separation research.
We introduce MedleyVox, an evaluation dataset for multiple singing voices separation.
We present a strategy for construction of multiple singing mixtures using various single-singing datasets.
- Score: 10.456845656569444
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Separation of multiple singing voices into each voice is a rarely studied
area in music source separation research. The absence of a benchmark dataset
has hindered its progress. In this paper, we present an evaluation dataset and
provide baseline studies for multiple singing voices separation. First, we
introduce MedleyVox, an evaluation dataset for multiple singing voices
separation. We specify the problem definition in this dataset by categorizing
it into i) unison, ii) duet, iii) main vs. rest, and iv) N-singing separation.
Second, to overcome the absence of existing multi-singing datasets for a
training purpose, we present a strategy for construction of multiple singing
mixtures using various single-singing datasets. Third, we propose the improved
super-resolution network (iSRNet), which greatly enhances initial estimates of
separation networks. Jointly trained with the Conv-TasNet and the multi-singing
mixture construction strategy, the proposed iSRNet achieved comparable
performance to ideal time-frequency masks on duet and unison subsets of
MedleyVox. Audio samples, the dataset, and codes are available on our website
(https://github.com/jeonchangbin49/MedleyVox).
Related papers
- High-Quality Visually-Guided Sound Separation from Diverse Categories [56.92841782969847]
DAVIS is a Diffusion-based Audio-VIsual Separation framework.
It synthesizes separated sounds directly from Gaussian noise, conditioned on both the audio mixture and the visual information.
We compare DAVIS to existing state-of-the-art discriminative audio-visual separation methods on the AVE and MUSIC datasets.
arXiv Detail & Related papers (2023-07-31T19:41:49Z) - Learning Phone Recognition from Unpaired Audio and Phone Sequences Based
on Generative Adversarial Network [58.82343017711883]
This paper investigates how to learn directly from unpaired phone sequences and speech utterances.
GAN training is adopted in the first stage to find the mapping relationship between unpaired speech and phone sequence.
In the second stage, another HMM model is introduced to train from the generator's output, which boosts the performance.
arXiv Detail & Related papers (2022-07-29T09:29:28Z) - Investigating Multi-Feature Selection and Ensembling for Audio
Classification [0.8602553195689513]
Deep Learning algorithms have shown impressive performance in diverse domains.
Audio has attracted many researchers over the last couple of decades due to some interesting patterns.
For better performance of audio classification, feature selection and combination play a key role.
arXiv Detail & Related papers (2022-06-15T13:11:08Z) - MultiSV: Dataset for Far-Field Multi-Channel Speaker Verification [0.0]
We present a comprehensive corpus designed for training and evaluating text-independent multi-channel speaker verification systems.
It can be readily used also for experiments with dereverberation, denoising, and speech enhancement.
arXiv Detail & Related papers (2021-11-11T20:55:58Z) - Fast accuracy estimation of deep learning based multi-class musical
source separation [79.10962538141445]
We propose a method to evaluate the separability of instruments in any dataset without training and tuning a neural network.
Based on the oracle principle with an ideal ratio mask, our approach is an excellent proxy to estimate the separation performances of state-of-the-art deep learning approaches.
arXiv Detail & Related papers (2020-10-19T13:05:08Z) - Multi-microphone Complex Spectral Mapping for Utterance-wise and
Continuous Speech Separation [79.63545132515188]
We propose multi-microphone complex spectral mapping for speaker separation in reverberant conditions.
Our system is trained on simulated room impulse responses based on a fixed number of microphones arranged in a given geometry.
State-of-the-art separation performance is obtained on the simulated two-talker SMS-WSJ corpus and the real-recorded LibriCSS dataset.
arXiv Detail & Related papers (2020-10-04T22:13:13Z) - Content based singing voice source separation via strong conditioning
using aligned phonemes [7.599399338954308]
In this paper, we present a multimodal multitrack dataset with lyrics aligned in time at the word level with phonetic information.
We show that phoneme conditioning can be successfully applied to improve singing voice source separation.
arXiv Detail & Related papers (2020-08-05T12:25:24Z) - dMelodies: A Music Dataset for Disentanglement Learning [70.90415511736089]
We present a new symbolic music dataset that will help researchers demonstrate the efficacy of their algorithms on diverse domains.
This will also provide a means for evaluating algorithms specifically designed for music.
The dataset is large enough (approx. 1.3 million data points) to train and test deep networks for disentanglement learning.
arXiv Detail & Related papers (2020-07-29T19:20:07Z) - Continuous speech separation: dataset and analysis [52.10378896407332]
In natural conversations, a speech signal is continuous, containing both overlapped and overlap-free components.
This paper describes a dataset and protocols for evaluating continuous speech separation algorithms.
arXiv Detail & Related papers (2020-01-30T18:01:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.