Jira: a Kurdish Speech Recognition System Designing and Building Speech
Corpus and Pronunciation Lexicon
- URL: http://arxiv.org/abs/2102.07412v1
- Date: Mon, 15 Feb 2021 09:27:54 GMT
- Title: Jira: a Kurdish Speech Recognition System Designing and Building Speech
Corpus and Pronunciation Lexicon
- Authors: Hadi Veisi, Hawre Hosseini, Mohammad Mohammadamini (LIA), Wirya Fathy,
Aso Mahmudi
- Abstract summary: We introduce the first large vocabulary speech recognition system (LVSR) for the Central Kurdish language, named Jira.
The Kurdish language is an Indo-European language spoken by more than 30 million people in several countries.
Regarding speech corpus, we designed a sentence collection in which the ratio of di-phones in the collection resembles the real data of the Central Kurdish language.
A test set including 11 different document topics is designed and recorded in two corresponding speech conditions.
- Score: 4.226093500082746
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we introduce the first large vocabulary speech recognition
system (LVSR) for the Central Kurdish language, named Jira. The Kurdish
language is an Indo-European language spoken by more than 30 million people in
several countries, but due to the lack of speech and text resources, there is
no speech recognition system for this language. To fill this gap, we introduce
the first speech corpus and pronunciation lexicon for the Kurdish language.
Regarding speech corpus, we designed a sentence collection in which the ratio
of di-phones in the collection resembles the real data of the Central Kurdish
language. The designed sentences are uttered by 576 speakers in a controlled
environment with noise-free microphones (called AsoSoft Speech-Office) and in
Telegram social network environment using mobile phones (denoted as AsoSoft
Speech-Crowdsourcing), resulted in 43.68 hours of speech. Besides, a test set
including 11 different document topics is designed and recorded in two
corresponding speech conditions (i.e., Office and Crowdsourcing). Furthermore,
a 60K pronunciation lexicon is prepared in this research in which we faced
several challenges and proposed solutions for them. The Kurdish language has
several dialects and sub-dialects that results in many lexical variations. Our
methods for script standardization of lexical variations and automatic
pronunciation of the lexicon tokens are presented in detail. To setup the
recognition engine, we used the Kaldi toolkit. A statistical tri-gram language
model that is extracted from the AsoSoft text corpus is used in the system.
Several standard recipes including HMM-based models (i.e., mono, tri1, tr2,
tri2, tri3), SGMM, and DNN methods are used to generate the acoustic model.
These methods are trained with AsoSoft Speech-Office and AsoSoft
Speech-Crowdsourcing and a combination of them. The best performance achieved
by the SGMM acoustic model which results in 13.9% of the average word error
rate (on different document topics) and 4.9% for the general topic.
Related papers
- Language and Speech Technology for Central Kurdish Varieties [27.751434601712]
Kurdish, an Indo-European language spoken by over 30 million speakers, is considered a dialect continuum.
Previous studies addressing language and speech technology for Kurdish handle it in a monolithic way as a macro-language.
In this paper, we take a step towards developing resources for language and speech technology for varieties of Central Kurdish.
arXiv Detail & Related papers (2024-03-04T12:27:32Z) - SeamlessM4T: Massively Multilingual & Multimodal Machine Translation [90.71078166159295]
We introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-text translation, and automatic speech recognition for up to 100 languages.
We developed the first multilingual system capable of translating from and into English for both speech and text.
On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation.
arXiv Detail & Related papers (2023-08-22T17:44:18Z) - Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation [65.13824257448564]
This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation.
By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech.
We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST)
arXiv Detail & Related papers (2023-08-03T15:47:04Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - PolyVoice: Language Models for Speech to Speech Translation [50.31000706309143]
PolyVoice is a language model-based framework for speech-to-speech translation (S2ST)
We use discretized speech units, which are generated in a fully unsupervised way.
For the speech synthesis part, we adopt the existing VALL-E X approach and build a unit-based audio language model.
arXiv Detail & Related papers (2023-06-05T15:53:15Z) - Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec
Language Modeling [92.55131711064935]
We propose a cross-lingual neural language model, VALL-E X, for cross-lingual speech synthesis.
VALL-E X inherits strong in-context learning capabilities and can be applied for zero-shot cross-lingual text-to-speech synthesis and zero-shot speech-to-speech translation tasks.
It can generate high-quality speech in the target language via just one speech utterance in the source language as a prompt while preserving the unseen speaker's voice, emotion, and acoustic environment.
arXiv Detail & Related papers (2023-03-07T14:31:55Z) - BASPRO: a balanced script producer for speech corpus collection based on
the genetic algorithm [29.701197643765674]
The performance of speech-processing models is heavily influenced by the speech corpus that is used for training and evaluation.
We propose BAlanced Script PROducer (BASPRO) system, which can automatically construct a phonetically balanced and rich set of Chinese sentences.
arXiv Detail & Related papers (2022-12-11T02:05:30Z) - ASR2K: Speech Recognition for Around 2000 Languages without Audio [100.41158814934802]
We present a speech recognition pipeline that does not require any audio for the target language.
Our pipeline consists of three components: acoustic, pronunciation, and language models.
We build speech recognition for 1909 languages by combining it with Crubadan: a large endangered languages n-gram database.
arXiv Detail & Related papers (2022-09-06T22:48:29Z) - QASR: QCRI Aljazeera Speech Resource -- A Large Scale Annotated Arabic
Speech Corpus [11.113497373432411]
We introduce the largest transcribed Arabic speech corpus, QASR, collected from the broadcast domain.
This multi-dialect speech dataset contains 2,000 hours of speech sampled at 16kHz crawled from Aljazeera news channel.
arXiv Detail & Related papers (2021-06-24T13:20:40Z) - The Interspeech Zero Resource Speech Challenge 2021: Spoken language
modelling [19.525392906001624]
We present the Zero Resource Speech Challenge 2021, which asks participants to learn a language model directly from audio, without any text or labels.
The challenge is based on the Libri-light dataset, which provides up to 60k hours of audio from English audio books without any associated text.
arXiv Detail & Related papers (2021-04-29T23:53:37Z) - A Crowdsourced Open-Source Kazakh Speech Corpus and Initial Speech
Recognition Baseline [4.521450956414864]
The Kazakh speech corpus (KSC) contains around 332 hours of transcribed audio comprising over 153,000 utterances spoken by participants from different regions and age groups.
The KSC is the largest publicly available database developed to advance various Kazakh speech and language processing applications.
arXiv Detail & Related papers (2020-09-22T05:57:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.