USC: An Open-Source Uzbek Speech Corpus and Initial Speech Recognition
Experiments
- URL: http://arxiv.org/abs/2107.14419v1
- Date: Fri, 30 Jul 2021 03:39:39 GMT
- Title: USC: An Open-Source Uzbek Speech Corpus and Initial Speech Recognition
Experiments
- Authors: Muhammadjon Musaev, Saida Mussakhojayeva, Ilyos Khujayorov, Yerbolat
Khassanov, Mannon Ochilov, Huseyin Atakan Varol
- Abstract summary: We present a freely available speech corpus for the Uzbek language.
We report preliminary automatic speech recognition (ASR) results using both the deep neural network hidden Markov model (DNN-HMM) and end-to-end (E2E) architectures.
- Score: 3.8673738158945326
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a freely available speech corpus for the Uzbek language and report
preliminary automatic speech recognition (ASR) results using both the deep
neural network hidden Markov model (DNN-HMM) and end-to-end (E2E)
architectures. The Uzbek speech corpus (USC) comprises 958 different speakers
with a total of 105 hours of transcribed audio recordings. To the best of our
knowledge, this is the first open-source Uzbek speech corpus dedicated to the
ASR task. To ensure high quality, the USC has been manually checked by native
speakers. We first describe the design and development procedures of the USC,
and then explain the conducted ASR experiments in detail. The experimental
results demonstrate promising results for the applicability of the USC for ASR.
Specifically, 18.1% and 17.4% word error rates were achieved on the validation
and test sets, respectively. To enable experiment reproducibility, we share the
USC dataset, pre-trained models, and training recipes in our GitHub repository.
Related papers
- One model to rule them all ? Towards End-to-End Joint Speaker
Diarization and Speech Recognition [50.055765860343286]
This paper presents a novel framework for joint speaker diarization and automatic speech recognition.
The framework, named SLIDAR, can process arbitrary length inputs and can handle any number of speakers.
Experiments performed on monaural recordings from the AMI corpus confirm the effectiveness of the method in both close-talk and far-field speech scenarios.
arXiv Detail & Related papers (2023-10-02T23:03:30Z) - Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot
Task Generalization [61.60501633397704]
We investigate the emergent abilities of the recently proposed web-scale speech model Whisper, by adapting it to unseen tasks with prompt engineering.
We design task-specific prompts, by either leveraging another large-scale model, or simply manipulating the special tokens in the default prompts.
Experiments show that our proposed prompts improve performance by 10% to 45% on the three zero-shot tasks, and even outperform SotA supervised models on some datasets.
arXiv Detail & Related papers (2023-05-18T16:32:58Z) - From English to More Languages: Parameter-Efficient Model Reprogramming
for Cross-Lingual Speech Recognition [50.93943755401025]
We propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition.
We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement.
Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses.
arXiv Detail & Related papers (2023-01-19T02:37:56Z) - SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder
Based Speech-Text Pre-training [106.34112664893622]
We propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder.
Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks.
arXiv Detail & Related papers (2022-10-07T17:57:45Z) - TALCS: An Open-Source Mandarin-English Code-Switching Corpus and a
Speech Recognition Baseline [0.0]
This paper introduces a new corpus of Mandarin-English code-switching speech recognition--TALCS corpus.
TALCS corpus is derived from real online one-to-one English teaching scenes in TAL education group.
To our best knowledge, TALCS corpus is the largest well labeled Mandarin-English code-switching open source automatic speech recognition dataset in the world.
arXiv Detail & Related papers (2022-06-27T09:30:25Z) - Joint Speech Recognition and Audio Captioning [37.205642807313545]
Speech samples recorded in both indoor and outdoor environments are often contaminated with secondary audio sources.
We aim to bring together the growing field of automated audio captioning (AAC) and the thoroughly studied automatic speech recognition (ASR)
We propose several approaches for end-to-end joint modeling of ASR and AAC tasks.
arXiv Detail & Related papers (2022-02-03T04:42:43Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - An Exploration of Self-Supervised Pretrained Representations for
End-to-End Speech Recognition [98.70304981174748]
We focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models.
We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR.
arXiv Detail & Related papers (2021-10-09T15:06:09Z) - QASR: QCRI Aljazeera Speech Resource -- A Large Scale Annotated Arabic
Speech Corpus [11.113497373432411]
We introduce the largest transcribed Arabic speech corpus, QASR, collected from the broadcast domain.
This multi-dialect speech dataset contains 2,000 hours of speech sampled at 16kHz crawled from Aljazeera news channel.
arXiv Detail & Related papers (2021-06-24T13:20:40Z) - A Crowdsourced Open-Source Kazakh Speech Corpus and Initial Speech
Recognition Baseline [4.521450956414864]
The Kazakh speech corpus (KSC) contains around 332 hours of transcribed audio comprising over 153,000 utterances spoken by participants from different regions and age groups.
The KSC is the largest publicly available database developed to advance various Kazakh speech and language processing applications.
arXiv Detail & Related papers (2020-09-22T05:57:15Z) - KoSpeech: Open-Source Toolkit for End-to-End Korean Speech Recognition [1.7955614278088239]
KoSpeech is an end-to-end Korean automatic speech recognition (ASR) toolkit based on the deep learning library PyTorch.
We propose preprocessing methods for KsponSpeech corpus and a baseline model for benchmarks.
Our baseline model achieved 10.31% character error rate (CER) at KsponSpeech corpus only with the acoustic model.
arXiv Detail & Related papers (2020-09-07T13:25:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.