A Crowdsourced Open-Source Kazakh Speech Corpus and Initial Speech
Recognition Baseline
- URL: http://arxiv.org/abs/2009.10334v2
- Date: Wed, 13 Jan 2021 09:08:07 GMT
- Title: A Crowdsourced Open-Source Kazakh Speech Corpus and Initial Speech
Recognition Baseline
- Authors: Yerbolat Khassanov, Saida Mussakhojayeva, Almas Mirzakhmetov, Alen
Adiyev, Mukhamet Nurpeiissov and Huseyin Atakan Varol
- Abstract summary: The Kazakh speech corpus (KSC) contains around 332 hours of transcribed audio comprising over 153,000 utterances spoken by participants from different regions and age groups.
The KSC is the largest publicly available database developed to advance various Kazakh speech and language processing applications.
- Score: 4.521450956414864
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present an open-source speech corpus for the Kazakh language. The Kazakh
speech corpus (KSC) contains around 332 hours of transcribed audio comprising
over 153,000 utterances spoken by participants from different regions and age
groups, as well as both genders. It was carefully inspected by native Kazakh
speakers to ensure high quality. The KSC is the largest publicly available
database developed to advance various Kazakh speech and language processing
applications. In this paper, we first describe the data collection and
preprocessing procedures followed by a description of the database
specifications. We also share our experience and challenges faced during the
database construction, which might benefit other researchers planning to build
a speech corpus for a low-resource language. To demonstrate the reliability of
the database, we performed preliminary speech recognition experiments. The
experimental results imply that the quality of audio and transcripts is
promising (2.8% character error rate and 8.7% word error rate on the test set).
To enable experiment reproducibility and ease the corpus usage, we also
released an ESPnet recipe for our speech recognition models.
Related papers
- Improving Whisper's Recognition Performance for Under-Represented Language Kazakh Leveraging Unpaired Speech and Text [22.19230427358921]
It is worth researching how to improve the performance of Whisper on under-represented languages.
We utilized easily accessible unpaired speech and text data and combined the language model GPT with Whisper on Kazakh.
We achieved more than 10% absolute WER reduction in multiple experiments.
arXiv Detail & Related papers (2024-08-10T13:39:13Z) - Identifying Speakers in Dialogue Transcripts: A Text-based Approach Using Pretrained Language Models [83.7506131809624]
We introduce an approach to identifying speaker names in dialogue transcripts, a crucial task for enhancing content accessibility and searchability in digital media archives.
We present a novel, large-scale dataset derived from the MediaSum corpus, encompassing transcripts from a wide range of media sources.
We propose novel transformer-based models tailored for SpeakerID, leveraging contextual cues within dialogues to accurately attribute speaker names.
arXiv Detail & Related papers (2024-07-16T18:03:58Z) - Speech-to-Speech Translation For A Real-world Unwritten Language [62.414304258701804]
We study speech-to-speech translation (S2ST) that translates speech from one language into another language.
We present an end-to-end solution from training data collection, modeling choices to benchmark dataset release.
arXiv Detail & Related papers (2022-11-11T20:21:38Z) - ASR2K: Speech Recognition for Around 2000 Languages without Audio [100.41158814934802]
We present a speech recognition pipeline that does not require any audio for the target language.
Our pipeline consists of three components: acoustic, pronunciation, and language models.
We build speech recognition for 1909 languages by combining it with Crubadan: a large endangered languages n-gram database.
arXiv Detail & Related papers (2022-09-06T22:48:29Z) - Automatic Dialect Density Estimation for African American English [74.44807604000967]
We explore automatic prediction of dialect density of the African American English (AAE) dialect.
dialect density is defined as the percentage of words in an utterance that contain characteristics of the non-standard dialect.
We show a significant correlation between our predicted and ground truth dialect density measures for AAE speech in this database.
arXiv Detail & Related papers (2022-04-03T01:34:48Z) - KazakhTTS2: Extending the Open-Source Kazakh TTS Corpus With More Data,
Speakers, and Topics [4.859986264602551]
We present an expanded version of our previously released Kazakh text-to-speech (KazakhTTS) synthesis corpus.
In the new KazakhTTS2 corpus, the overall size is increased from 93 hours to 271 hours.
The number of speakers has risen from two to five (three females and two males), and the topic coverage is diversified with the help of new sources, including a book and Wikipedia articles.
arXiv Detail & Related papers (2022-01-15T06:54:30Z) - USC: An Open-Source Uzbek Speech Corpus and Initial Speech Recognition
Experiments [3.8673738158945326]
We present a freely available speech corpus for the Uzbek language.
We report preliminary automatic speech recognition (ASR) results using both the deep neural network hidden Markov model (DNN-HMM) and end-to-end (E2E) architectures.
arXiv Detail & Related papers (2021-07-30T03:39:39Z) - QASR: QCRI Aljazeera Speech Resource -- A Large Scale Annotated Arabic
Speech Corpus [11.113497373432411]
We introduce the largest transcribed Arabic speech corpus, QASR, collected from the broadcast domain.
This multi-dialect speech dataset contains 2,000 hours of speech sampled at 16kHz crawled from Aljazeera news channel.
arXiv Detail & Related papers (2021-06-24T13:20:40Z) - Jira: a Kurdish Speech Recognition System Designing and Building Speech
Corpus and Pronunciation Lexicon [4.226093500082746]
We introduce the first large vocabulary speech recognition system (LVSR) for the Central Kurdish language, named Jira.
The Kurdish language is an Indo-European language spoken by more than 30 million people in several countries.
Regarding speech corpus, we designed a sentence collection in which the ratio of di-phones in the collection resembles the real data of the Central Kurdish language.
A test set including 11 different document topics is designed and recorded in two corresponding speech conditions.
arXiv Detail & Related papers (2021-02-15T09:27:54Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z) - CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus [57.641761472372814]
CoVoST is a multilingual speech-to-text translation corpus from 11 languages into English.
It diversified with over 11,000 speakers and over 60 accents.
CoVoST is released under CC0 license and free to use.
arXiv Detail & Related papers (2020-02-04T14:35:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.