JukeBox: A Multilingual Singer Recognition Dataset
- URL: http://arxiv.org/abs/2008.03507v1
- Date: Sat, 8 Aug 2020 12:22:51 GMT
- Title: JukeBox: A Multilingual Singer Recognition Dataset
- Authors: Anurag Chowdhury, Austin Cozzo, Arun Ross
- Abstract summary: textitJukeBox is a speaker recognition dataset with multilingual singing voice audio annotated with singer identity, gender, and language labels.
We use the current state-of-the-art methods to demonstrate the difficulty of performing speaker recognition on singing voice using models trained on spoken voice alone.
- Score: 17.33151600403503
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A text-independent speaker recognition system relies on successfully encoding
speech factors such as vocal pitch, intensity, and timbre to achieve good
performance. A majority of such systems are trained and evaluated using spoken
voice or everyday conversational voice data. Spoken voice, however, exhibits a
limited range of possible speaker dynamics, thus constraining the utility of
the derived speaker recognition models. Singing voice, on the other hand,
covers a broader range of vocal and ambient factors and can, therefore, be used
to evaluate the robustness of a speaker recognition system. However, a majority
of existing speaker recognition datasets only focus on the spoken voice. In
comparison, there is a significant shortage of labeled singing voice data
suitable for speaker recognition research. To address this issue, we assemble
\textit{JukeBox} - a speaker recognition dataset with multilingual singing
voice audio annotated with singer identity, gender, and language labels. We use
the current state-of-the-art methods to demonstrate the difficulty of
performing speaker recognition on singing voice using models trained on spoken
voice alone. We also evaluate the effect of gender and language on speaker
recognition performance, both in spoken and singing voice data. The complete
\textit{JukeBox} dataset can be accessed at
http://iprobe.cse.msu.edu/datasets/jukebox.html.
Related papers
- Character-aware audio-visual subtitling in context [58.95580154761008]
This paper presents an improved framework for character-aware audio-visual subtitling in TV shows.
Our approach integrates speech recognition, speaker diarisation, and character recognition, utilising both audio and visual cues.
We validate the method on a dataset with 12 TV shows, demonstrating superior performance in speaker diarisation and character recognition accuracy compared to existing approaches.
arXiv Detail & Related papers (2024-10-14T20:27:34Z) - Identifying Speakers in Dialogue Transcripts: A Text-based Approach Using Pretrained Language Models [83.7506131809624]
We introduce an approach to identifying speaker names in dialogue transcripts, a crucial task for enhancing content accessibility and searchability in digital media archives.
We present a novel, large-scale dataset derived from the MediaSum corpus, encompassing transcripts from a wide range of media sources.
We propose novel transformer-based models tailored for SpeakerID, leveraging contextual cues within dialogues to accurately attribute speaker names.
arXiv Detail & Related papers (2024-07-16T18:03:58Z) - Singer Identity Representation Learning using Self-Supervised Techniques [0.0]
We propose a framework for training singer identity encoders to extract representations suitable for various singing-related tasks.
We explore different self-supervised learning techniques on a large collection of isolated vocal tracks.
We evaluate the quality of the resulting representations on singer similarity and identification tasks.
arXiv Detail & Related papers (2024-01-10T10:41:38Z) - Enhancing the vocal range of single-speaker singing voice synthesis with
melody-unsupervised pre-training [82.94349771571642]
This work proposes a melody-unsupervised multi-speaker pre-training method to enhance the vocal range of the single-speaker.
It is the first to introduce a differentiable duration regulator to improve the rhythm naturalness of the synthesized voice.
Experimental results verify that the proposed SVS system outperforms the baseline on both sound quality and naturalness.
arXiv Detail & Related papers (2023-09-01T06:40:41Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - Vocalsound: A Dataset for Improving Human Vocal Sounds Recognition [13.373579620368046]
We have created a VocalSound dataset consisting of over 21,000 crowdsourced recordings of laughter, sighs, coughs, throat clearing, sneezes, and sniffs.
Experiments show that the vocal sound recognition performance of a model can be significantly improved by 41.9% by adding VocalSound dataset to an existing dataset as training material.
arXiv Detail & Related papers (2022-05-06T18:08:18Z) - Learning Speech Representations from Raw Audio by Joint Audiovisual
Self-Supervision [63.564385139097624]
We propose a method to learn self-supervised speech representations from the raw audio waveform.
We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio)
Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.
arXiv Detail & Related papers (2020-07-08T14:07:06Z) - Joint Speaker Counting, Speech Recognition, and Speaker Identification
for Overlapped Speech of Any Number of Speakers [38.3469744871394]
We propose an end-to-end speaker-attributed automatic speech recognition model.
It unifies speaker counting, speech recognition, and speaker identification on overlapped speech.
arXiv Detail & Related papers (2020-06-19T02:05:18Z) - VoiceCoach: Interactive Evidence-based Training for Voice Modulation
Skills in Public Speaking [55.366941476863644]
The modulation of voice properties, such as pitch, volume, and speed, is crucial for delivering a successful public speech.
We present VoiceCoach, an interactive evidence-based approach to facilitate the effective training of voice modulation skills.
arXiv Detail & Related papers (2020-01-22T04:52:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.