MLS: A Large-Scale Multilingual Dataset for Speech Research
- URL: http://arxiv.org/abs/2012.03411v2
- Date: Sat, 19 Dec 2020 09:18:21 GMT
- Title: MLS: A Large-Scale Multilingual Dataset for Speech Research
- Authors: Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, Ronan
Collobert
- Abstract summary: The dataset is derived from read audiobooks from LibriVox.
It consists of 8 languages, including about 44.5K hours of English and a total of about 6K hours for other languages.
- Score: 37.803100082550294
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces Multilingual LibriSpeech (MLS) dataset, a large
multilingual corpus suitable for speech research. The dataset is derived from
read audiobooks from LibriVox and consists of 8 languages, including about
44.5K hours of English and a total of about 6K hours for other languages.
Additionally, we provide Language Models (LM) and baseline Automatic Speech
Recognition (ASR) models and for all the languages in our dataset. We believe
such a large transcribed dataset will open new avenues in ASR and
Text-To-Speech (TTS) research. The dataset will be made freely available for
anyone at http://www.openslr.org.
Related papers
- Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond [36.660499609887886]
Speech-MASSIVE is a multilingual Spoken Language Understanding dataset.
It covers 12 languages from different families and inherits from the annotations for the intent prediction and slot-filling tasks.
We demonstrate the suitability of Speech-MASSIVE for other tasks such as speech transcription, language identification, and speech translation.
arXiv Detail & Related papers (2024-08-07T16:55:28Z) - Towards Robust Speech Representation Learning for Thousands of Languages [77.2890285555615]
Self-supervised learning (SSL) has helped extend speech technologies to more languages by reducing the need for labeled data.
We propose XEUS, a Cross-lingual for Universal Speech, trained on over 1 million hours of data across 4057 languages.
arXiv Detail & Related papers (2024-06-30T21:40:26Z) - ViSpeR: Multilingual Audio-Visual Speech Recognition [9.40993779729177]
This work presents an extensive and detailed study on Audio-Visual Speech Recognition for five widely spoken languages.
We have collected large-scale datasets for each language except for English, and have engaged in the training of supervised learning models.
Our model, ViSpeR, is trained in a multi-lingual setting, resulting in competitive performance on newly established benchmarks for each language.
arXiv Detail & Related papers (2024-05-27T14:48:51Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - Zambezi Voice: A Multilingual Speech Corpus for Zambian Languages [20.25236081418051]
Zambezi Voice is an open-source multilingual speech resource for Zambian languages.
To our knowledge, this is the first multilingual speech dataset created for Zambian languages.
arXiv Detail & Related papers (2023-06-07T13:36:37Z) - Scaling Speech Technology to 1,000+ Languages [66.31120979098483]
The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task.
Main ingredients are a new dataset based on readings of publicly available religious texts.
We built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, and a language identification model for 4,017 languages.
arXiv Detail & Related papers (2023-05-22T22:09:41Z) - ASR2K: Speech Recognition for Around 2000 Languages without Audio [100.41158814934802]
We present a speech recognition pipeline that does not require any audio for the target language.
Our pipeline consists of three components: acoustic, pronunciation, and language models.
We build speech recognition for 1909 languages by combining it with Crubadan: a large endangered languages n-gram database.
arXiv Detail & Related papers (2022-09-06T22:48:29Z) - The Multilingual TEDx Corpus for Speech Recognition and Translation [30.993199499048824]
We present the Multilingual TEDx corpus, built to support speech recognition (ASR) and speech translation (ST) research across many non-English source languages.
The corpus is a collection of audio recordings from TEDx talks in 8 source languages.
We segment transcripts into sentences and align them to the source-language audio and target-language translations.
arXiv Detail & Related papers (2021-02-02T21:16:25Z) - CoVoST 2 and Massively Multilingual Speech-to-Text Translation [24.904548615918355]
CoVoST 2 is a large-scale multilingual speech translation corpus covering translations from 21 languages into English and from English into 15 languages.
This represents the largest open dataset available to date from total volume and language coverage perspective.
arXiv Detail & Related papers (2020-07-20T17:53:35Z) - CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus [57.641761472372814]
CoVoST is a multilingual speech-to-text translation corpus from 11 languages into English.
It diversified with over 11,000 speakers and over 60 accents.
CoVoST is released under CC0 license and free to use.
arXiv Detail & Related papers (2020-02-04T14:35:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.