VOTE400(Voide Of The Elderly 400 Hours): A Speech Dataset to Study Voice
Interface for Elderly-Care
- URL: http://arxiv.org/abs/2101.11469v1
- Date: Wed, 20 Jan 2021 05:28:05 GMT
- Title: VOTE400(Voide Of The Elderly 400 Hours): A Speech Dataset to Study Voice
Interface for Elderly-Care
- Authors: Minsu Jang, Sangwon Seo, Dohyung Kim, Jaeyeon Lee, Jaehong Kim,
Jun-Hwan Ahn
- Abstract summary: The dataset includes about 300 hours of continuous dialog speech and 100 hours of read speech, both recorded by the elderly people aged 65 years or over.
A preliminary experiment showed that speech recognition system trained with VOTE400 can outperform conventional systems in speech recognition of elderly people's voice.
- Score: 11.87467300760354
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper introduces a large-scale Korean speech dataset, called VOTE400,
that can be used for analyzing and recognizing voices of the elderly people.
The dataset includes about 300 hours of continuous dialog speech and 100 hours
of read speech, both recorded by the elderly people aged 65 years or over. A
preliminary experiment showed that speech recognition system trained with
VOTE400 can outperform conventional systems in speech recognition of elderly
people's voice. This work is a multi-organizational effort led by ETRI and
MINDs Lab Inc. for the purpose of advancing the speech recognition performance
of the elderly-care robots.
Related papers
- SeniorTalk: A Chinese Conversation Dataset with Rich Annotations for Super-Aged Seniors [23.837811649327094]
SeniorTalk is a carefully annotated Chinese spoken dialogue dataset.
This dataset contains 55.53 hours of speech from 101 natural conversations involving 202 participants.
We perform experiments on speaker verification, speaker diarization, speech recognition, and speech editing tasks.
arXiv Detail & Related papers (2025-03-20T11:31:47Z) - Towards Unsupervised Speech Recognition Without Pronunciation Models [57.222729245842054]
Most languages lack sufficient paired speech and text data to effectively train automatic speech recognition systems.
We propose the removal of reliance on a phoneme lexicon to develop unsupervised ASR systems.
We experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling.
arXiv Detail & Related papers (2024-06-12T16:30:58Z) - AS-70: A Mandarin stuttered speech dataset for automatic speech recognition and stuttering event detection [46.855958156126164]
AS-70 is the first publicly available Mandarin stuttered speech dataset.
This paper introduces AS-70, the first publicly available Mandarin stuttered speech dataset.
arXiv Detail & Related papers (2024-06-11T13:35:50Z) - EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation [83.29199726650899]
The EARS dataset comprises 107 speakers from diverse backgrounds, totaling in 100 hours of clean, anechoic speech data.
The dataset covers a large range of different speaking styles, including emotional speech, different reading styles, non-verbal sounds, and conversational freeform speech.
We benchmark various methods for speech enhancement and dereverberation on the dataset and evaluate their performance through a set of instrumental metrics.
arXiv Detail & Related papers (2024-06-10T11:28:29Z) - Latent Phrase Matching for Dysarthric Speech [23.23672790496787]
Many consumer speech recognition systems are not tuned for people with speech disabilities.
We propose a query-by-example-based personalized phrase recognition system that is trained using small amounts of speech.
Performance degrades as the number of phrases increases, but consistently outperforms ASR systems when trained with 50 unique phrases.
arXiv Detail & Related papers (2023-06-08T17:28:28Z) - Speaker Identification using Speech Recognition [0.0]
This research provides a mechanism for identifying a speaker in an audio file, based on the human voice biometric features like pitch, amplitude, frequency etc.
We proposed an unsupervised learning model where the model can learn speech representation with limited dataset.
arXiv Detail & Related papers (2022-05-29T13:03:42Z) - Self-Supervised Speech Representations Preserve Speech Characteristics
while Anonymizing Voices [15.136348385992047]
We train several voice conversion models using self-supervised speech representations.
Converted voices retain a low word error rate within 1% of the original voice.
Experiments on dysarthric speech data show that speech features relevant to articulation, prosody, phonation and phonology can be extracted from anonymized voices.
arXiv Detail & Related papers (2022-04-04T17:48:01Z) - Unsupervised Text-to-Speech Synthesis by Unsupervised Automatic Speech
Recognition [60.84668086976436]
An unsupervised text-to-speech synthesis (TTS) system learns to generate the speech waveform corresponding to any written sentence in a language.
This paper proposes an unsupervised TTS system by leveraging recent advances in unsupervised automatic speech recognition (ASR)
Our unsupervised system can achieve comparable performance to the supervised system in seven languages with about 10-20 hours of speech each.
arXiv Detail & Related papers (2022-03-29T17:57:53Z) - ASR data augmentation in low-resource settings using cross-lingual
multi-speaker TTS and cross-lingual voice conversion [49.617722668505834]
We show that our approach permits the application of speech synthesis and voice conversion to improve ASR systems using only one target-language speaker during model training.
It is possible to obtain promising ASR training results with our data augmentation method using only a single real speaker in a target language.
arXiv Detail & Related papers (2022-03-29T11:55:30Z) - Mandarin-English Code-switching Speech Recognition with Self-supervised
Speech Representation Models [55.82292352607321]
Code-switching (CS) is common in daily conversations where more than one language is used within a sentence.
This paper uses the recently successful self-supervised learning (SSL) methods to leverage many unlabeled speech data without CS.
arXiv Detail & Related papers (2021-10-07T14:43:35Z) - JukeBox: A Multilingual Singer Recognition Dataset [17.33151600403503]
textitJukeBox is a speaker recognition dataset with multilingual singing voice audio annotated with singer identity, gender, and language labels.
We use the current state-of-the-art methods to demonstrate the difficulty of performing speaker recognition on singing voice using models trained on spoken voice alone.
arXiv Detail & Related papers (2020-08-08T12:22:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.