Exploration of End-to-End ASR for OpenSTT -- Russian Open Speech-to-Text
Dataset
- URL: http://arxiv.org/abs/2006.08274v2
- Date: Sun, 26 Jul 2020 20:21:09 GMT
- Title: Exploration of End-to-End ASR for OpenSTT -- Russian Open Speech-to-Text
Dataset
- Authors: Andrei Andrusenko, Aleksandr Laptev, Ivan Medennikov
- Abstract summary: This paper presents an exploration of end-to-end automatic speech recognition systems (ASR) for the largest open-source Russian language data set -- OpenSTT.
We evaluate different existing end-to-end approaches such as joint CTC/Attention, RNN-Transducer, and Transformer.
For the three available validation sets (phone calls, YouTube, and books), our best end-to-end model achieves word error rate (WER) of 34.8%, 19.1%, and 18.1%, respectively.
- Score: 73.66530509749305
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents an exploration of end-to-end automatic speech recognition
systems (ASR) for the largest open-source Russian language data set -- OpenSTT.
We evaluate different existing end-to-end approaches such as joint
CTC/Attention, RNN-Transducer, and Transformer. All of them are compared with
the strong hybrid ASR system based on LF-MMI TDNN-F acoustic model. For the
three available validation sets (phone calls, YouTube, and books), our best
end-to-end model achieves word error rate (WER) of 34.8%, 19.1%, and 18.1%,
respectively. Under the same conditions, the hybridASR system demonstrates
33.5%, 20.9%, and 18.6% WER.
Related papers
- One Whisper to Grade Them All [10.035434464829958]
We present an efficient end-to-end approach for holistic Automatic Speaking Assessment (ASA) of multi-part second-language tests.<n>Our system's main novelty is the ability to process all four spoken responses with a single Whisper-small encoder.<n>This architecture removes the need for transcription and per-part models, cuts inference time, and makes ASA practical for large-scale Computer-Assisted Language Learning systems.
arXiv Detail & Related papers (2025-07-23T20:31:40Z) - A Self-Refining Framework for Enhancing ASR Using TTS-Synthesized Data [46.73430446242378]
We propose a self-refining framework that enhances ASR performance with only unlabeled datasets.<n>We demonstrate the effectiveness of the framework on Taiwanese Mandarin speech.
arXiv Detail & Related papers (2025-06-10T17:30:32Z) - KIT's Low-resource Speech Translation Systems for IWSLT2025: System Enhancement with Synthetic Data and Model Regularization [57.08591486199925]
This paper presents KIT's submissions to the IWSLT 2025 low-resource track.<n>We develop both cascaded systems, and end-to-end (E2E) Speech Translation systems.<n>Building upon pre-trained models, we fine-tune our systems with different strategies to utilize resources efficiently.
arXiv Detail & Related papers (2025-05-26T08:38:02Z) - An Exhaustive Evaluation of TTS- and VC-based Data Augmentation for ASR [12.197936305117407]
Augmenting the training data of automatic speech recognition systems with synthetic data generated by text-to-speech (TTS) or voice conversion (VC) has gained popularity in recent years.
We leverage recently proposed flow-based TTS/VC models allowing greater speech diversity, and assess the respective impact of augmenting various speech attributes on the word error rate (WER) achieved by several ASR models.
arXiv Detail & Related papers (2025-03-11T23:09:06Z) - Whisper Finetuning on Nepali Language [0.0]
This research focuses on making an exhaustive and generalized dataset followed by fine-tuning OpenAI's Whisper models to improve transcription accuracy for the Nepali language.
We leverage publicly available ASR datasets and self-recorded custom datasets with a diverse range of accents, dialects, and speaking styles further enriched through augmentation.
Our approach outperforms Whisper's baseline models trained on Fleur's dataset, achieving WER reductions of up to 36.2% on the small and 23.8% on medium models.
arXiv Detail & Related papers (2024-11-19T15:55:56Z) - Self-supervised ASR Models and Features For Dysarthric and Elderly Speech Recognition [71.87998918300806]
This paper explores approaches to integrate domain fine-tuned SSL pre-trained models and their features into TDNN and Conformer ASR systems.
TDNN systems constructed by integrating domain-adapted HuBERT, wav2vec2-conformer or multi-lingual XLSR models consistently outperform standalone fine-tuned SSL pre-trained models.
Consistent improvements in Alzheimer's Disease detection accuracy are also obtained using the DementiaBank Pitt elderly speech recognition outputs.
arXiv Detail & Related papers (2024-07-03T08:33:39Z) - Automatic Speech Recognition Advancements for Indigenous Languages of the Americas [0.0]
The Second Americas (Americas Natural Language Processing) Competition Track 1 of NeurIPS (Neural Information Processing Systems) 2022 proposed the task of training automatic speech recognition systems for five Indigenous languages: Quechua, Guarani, Bribri, Kotiria, and Wa'ikhana.
We describe the fine-tuning of a state-of-the-art ASR model for each target language, using approximately 36.65 h of transcribed speech data from diverse sources enriched with data augmentation methods.
We release our best models for each language, marking the first open ASR models for Wa'ikhana and Kotiria.
arXiv Detail & Related papers (2024-04-12T10:12:38Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - Data Augmentation for End-to-end Code-switching Speech Recognition [54.0507000473827]
Three novel approaches are proposed for code-switching data augmentation.
Audio splicing with the existing code-switching data, and TTS with new code-switching texts generated by word translation or word insertion.
Experiments on 200 hours Mandarin-English code-switching dataset show significant improvements on code-switching ASR individually.
arXiv Detail & Related papers (2020-11-04T07:12:44Z) - You Do Not Need More Data: Improving End-To-End Speech Recognition by
Text-To-Speech Data Augmentation [59.31769998728787]
We build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model.
Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other.
arXiv Detail & Related papers (2020-05-14T17:24:57Z) - Jointly Trained Transformers models for Spoken Language Translation [2.3886615435250302]
This work trains SLT systems with ASR objective as an auxiliary loss and both the networks are connected through neural hidden representations.
This architecture has improved from BLEU from 36.8 to 44.5.
All the experiments are reported on English-Portuguese speech translation task using How2 corpus.
arXiv Detail & Related papers (2020-04-25T11:28:39Z) - Towards a Competitive End-to-End Speech Recognition for CHiME-6 Dinner
Party Transcription [73.66530509749305]
In this paper, we argue that, even in difficult cases, some end-to-end approaches show performance close to the hybrid baseline.
We experimentally compare and analyze CTC-Attention versus RNN-Transducer approaches along with RNN versus Transformer architectures.
Our best end-to-end model based on RNN-Transducer, together with improved beam search, reaches quality by only 3.8% WER abs. worse than the LF-MMI TDNN-F CHiME-6 Challenge baseline.
arXiv Detail & Related papers (2020-04-22T19:08:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.