YODAS: Youtube-Oriented Dataset for Audio and Speech
- URL: http://arxiv.org/abs/2406.00899v1
- Date: Sun, 2 Jun 2024 23:43:27 GMT
- Title: YODAS: Youtube-Oriented Dataset for Audio and Speech
- Authors: Xinjian Li, Shinnosuke Takamichi, Takaaki Saeki, William Chen, Sayaka Shiota, Shinji Watanabe,
- Abstract summary: YODAS is a large-scale, multilingual dataset comprising over 500k hours of speech data in more than 100 languages.
The labeled subsets, including manual or automatic subtitles, facilitate supervised model training.
YODAS is distinctive as the first publicly available dataset of its scale, and it is distributed under a Creative Commons license.
- Score: 47.60574092241447
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this study, we introduce YODAS (YouTube-Oriented Dataset for Audio and Speech), a large-scale, multilingual dataset comprising currently over 500k hours of speech data in more than 100 languages, sourced from both labeled and unlabeled YouTube speech datasets. The labeled subsets, including manual or automatic subtitles, facilitate supervised model training. Conversely, the unlabeled subsets are apt for self-supervised learning applications. YODAS is distinctive as the first publicly available dataset of its scale, and it is distributed under a Creative Commons license. We introduce the collection methodology utilized for YODAS, which contributes to the large-scale speech dataset construction. Subsequently, we provide a comprehensive analysis of speech, text contained within the dataset. Finally, we describe the speech recognition baselines over the top-15 languages.
Related papers
- Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond [36.660499609887886]
Speech-MASSIVE is a multilingual Spoken Language Understanding dataset.
It covers 12 languages from different families and inherits from the annotations for the intent prediction and slot-filling tasks.
We demonstrate the suitability of Speech-MASSIVE for other tasks such as speech transcription, language identification, and speech translation.
arXiv Detail & Related papers (2024-08-07T16:55:28Z) - EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation [83.29199726650899]
The EARS dataset comprises 107 speakers from diverse backgrounds, totaling in 100 hours of clean, anechoic speech data.
The dataset covers a large range of different speaking styles, including emotional speech, different reading styles, non-verbal sounds, and conversational freeform speech.
We benchmark various methods for speech enhancement and dereverberation on the dataset and evaluate their performance through a set of instrumental metrics.
arXiv Detail & Related papers (2024-06-10T11:28:29Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - OLKAVS: An Open Large-Scale Korean Audio-Visual Speech Dataset [14.619865864254924]
Open Large-scale Korean Audio-Visual Speech (OLKAVS) dataset is the largest among publicly available audio-visual speech datasets.
The dataset contains 1,150 hours of transcribed audio from 1,107 Korean speakers in a studio setup with nine different viewpoints and various noise situations.
arXiv Detail & Related papers (2023-01-16T11:40:50Z) - ASR2K: Speech Recognition for Around 2000 Languages without Audio [100.41158814934802]
We present a speech recognition pipeline that does not require any audio for the target language.
Our pipeline consists of three components: acoustic, pronunciation, and language models.
We build speech recognition for 1909 languages by combining it with Crubadan: a large endangered languages n-gram database.
arXiv Detail & Related papers (2022-09-06T22:48:29Z) - Textless Speech-to-Speech Translation on Real Data [49.134208897722246]
We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language.
We tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data.
arXiv Detail & Related papers (2021-12-15T18:56:35Z) - SpeechStew: Simply Mix All Available Speech Recognition Data to Train
One Large Neural Network [45.59907668722702]
We present SpeechStew, a speech recognition model that is trained on a combination of publicly available speech recognition datasets.
Our results include 9.0% WER on AMI-IHM, 4.7% WER on Switchboard, 8.3% WER on CallHome, and 1.3% on WSJ.
We also demonstrate that SpeechStew learns powerful transfer learning representations.
arXiv Detail & Related papers (2021-04-05T20:13:36Z) - AVLnet: Learning Audio-Visual Language Representations from
Instructional Videos [69.56522471911396]
We introduce the Audio-Video Language Network (AVLnet), a self-supervised network that learns a shared audio-visual embedding space directly from raw video inputs.
We train AVLnet on HowTo100M, a large corpus of publicly available instructional videos, and evaluate on image retrieval and video retrieval tasks.
Our code, data, and trained models will be released at avlnet.csail.mit.edu.
arXiv Detail & Related papers (2020-06-16T14:38:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.