Vocalsound: A Dataset for Improving Human Vocal Sounds Recognition
- URL: http://arxiv.org/abs/2205.03433v1
- Date: Fri, 6 May 2022 18:08:18 GMT
- Title: Vocalsound: A Dataset for Improving Human Vocal Sounds Recognition
- Authors: Yuan Gong, Jin Yu, James Glass
- Abstract summary: We have created a VocalSound dataset consisting of over 21,000 crowdsourced recordings of laughter, sighs, coughs, throat clearing, sneezes, and sniffs.
Experiments show that the vocal sound recognition performance of a model can be significantly improved by 41.9% by adding VocalSound dataset to an existing dataset as training material.
- Score: 13.373579620368046
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recognizing human non-speech vocalizations is an important task and has broad
applications such as automatic sound transcription and health condition
monitoring. However, existing datasets have a relatively small number of vocal
sound samples or noisy labels. As a consequence, state-of-the-art audio event
classification models may not perform well in detecting human vocal sounds. To
support research on building robust and accurate vocal sound recognition, we
have created a VocalSound dataset consisting of over 21,000 crowdsourced
recordings of laughter, sighs, coughs, throat clearing, sneezes, and sniffs
from 3,365 unique subjects. Experiments show that the vocal sound recognition
performance of a model can be significantly improved by 41.9% by adding
VocalSound dataset to an existing dataset as training material. In addition,
different from previous datasets, the VocalSound dataset contains meta
information such as speaker age, gender, native language, country, and health
condition.
Related papers
- Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt [50.25271407721519]
We propose Prompt-Singer, the first SVS method that enables attribute controlling on singer gender, vocal range and volume with natural language.
We adopt a model architecture based on a decoder-only transformer with a multi-scale hierarchy, and design a range-melody decoupled pitch representation.
Experiments show that our model achieves favorable controlling ability and audio quality.
arXiv Detail & Related papers (2024-03-18T13:39:05Z) - Singer Identity Representation Learning using Self-Supervised Techniques [0.0]
We propose a framework for training singer identity encoders to extract representations suitable for various singing-related tasks.
We explore different self-supervised learning techniques on a large collection of isolated vocal tracks.
We evaluate the quality of the resulting representations on singer similarity and identification tasks.
arXiv Detail & Related papers (2024-01-10T10:41:38Z) - Enhancing the vocal range of single-speaker singing voice synthesis with
melody-unsupervised pre-training [82.94349771571642]
This work proposes a melody-unsupervised multi-speaker pre-training method to enhance the vocal range of the single-speaker.
It is the first to introduce a differentiable duration regulator to improve the rhythm naturalness of the synthesized voice.
Experimental results verify that the proposed SVS system outperforms the baseline on both sound quality and naturalness.
arXiv Detail & Related papers (2023-09-01T06:40:41Z) - Make-A-Voice: Unified Voice Synthesis With Discrete Representation [77.3998611565557]
Make-A-Voice is a unified framework for synthesizing and manipulating voice signals from discrete representations.
We show that Make-A-Voice exhibits superior audio quality and style similarity compared with competitive baseline models.
arXiv Detail & Related papers (2023-05-30T17:59:26Z) - A dataset for Audio-Visual Sound Event Detection in Movies [33.59510253345295]
We present a dataset of audio events called Subtitle-Aligned Movie Sounds (SAM-S)
We use publicly-available closed-caption transcripts to automatically mine over 110K audio events from 430 movies.
We identify three dimensions to categorize audio events: sound, source, quality, and present the steps involved to produce a final taxonomy of 245 sounds.
arXiv Detail & Related papers (2023-02-14T19:55:39Z) - EmoGator: A New Open Source Vocal Burst Dataset with Baseline Machine
Learning Classification Methodologies [0.0]
The EmoGator dataset consists of 32,130 samples from 357 speakers, 16.9654 hours of audio.
Each sample classified into one of 30 distinct emotion categories by the speaker.
arXiv Detail & Related papers (2023-01-02T03:02:10Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - JukeBox: A Multilingual Singer Recognition Dataset [17.33151600403503]
textitJukeBox is a speaker recognition dataset with multilingual singing voice audio annotated with singer identity, gender, and language labels.
We use the current state-of-the-art methods to demonstrate the difficulty of performing speaker recognition on singing voice using models trained on spoken voice alone.
arXiv Detail & Related papers (2020-08-08T12:22:51Z) - DeepSinger: Singing Voice Synthesis with Data Mined From the Web [194.10598657846145]
DeepSinger is a multi-lingual singing voice synthesis system built from scratch using singing training data mined from music websites.
We evaluate DeepSinger on our mined singing dataset that consists of about 92 hours data from 89 singers on three languages.
arXiv Detail & Related papers (2020-07-09T07:00:48Z) - VGGSound: A Large-scale Audio-Visual Dataset [160.1604237188594]
We propose a scalable pipeline to create an audio dataset from open-source media.
We use this pipeline to curate the VGGSound dataset consisting of more than 210k videos for 310 audio classes.
The resulting dataset can be used for training and evaluating audio recognition models.
arXiv Detail & Related papers (2020-04-29T17:46:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.