Thai Wav2Vec2.0 with CommonVoice V8
- URL: http://arxiv.org/abs/2208.04799v1
- Date: Tue, 9 Aug 2022 14:21:48 GMT
- Title: Thai Wav2Vec2.0 with CommonVoice V8
- Authors: Wannaphong Phatthiyaphaibun, Chompakorn Chaksangchaichot, Peerat
Limkonchotiwat, Ekapol Chuangsuwanich, Sarana Nutanong
- Abstract summary: Most publicly available Automatic Speech Recognition (ASR) models are available in English; only a minority of the models are available in Thai.
Most of the Thai ASR models are closed-sourced, and the performance of existing open-sourced models lacks robustness.
We train a new ASR model on a pre-trained XLSR-Wav2Vec model with the Thai CommonVoice corpus V8 and train a trigram language model to boost the performance of our ASR model.
- Score: 7.818074118880726
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, Automatic Speech Recognition (ASR), a system that converts audio
into text, has caught a lot of attention in the machine learning community.
Thus, a lot of publicly available models were released in HuggingFace. However,
most of these ASR models are available in English; only a minority of the
models are available in Thai. Additionally, most of the Thai ASR models are
closed-sourced, and the performance of existing open-sourced models lacks
robustness. To address this problem, we train a new ASR model on a pre-trained
XLSR-Wav2Vec model with the Thai CommonVoice corpus V8 and train a trigram
language model to boost the performance of our ASR model. We hope that our
models will be beneficial to individuals and the ASR community in Thailand.
Related papers
- Qwen3-ASR Technical Report [71.87071808763484]
We introduce Qwen3-ASR family, which includes two powerful all-in-one speech recognition models and a novel non-autoregressive speech forced alignment model.<n>Qwen3-ASR-1.7B and Qwen3-ASR-0.6B are ASR models that support language identification and ASR for 52 languages and dialects.
arXiv Detail & Related papers (2026-01-29T06:58:13Z) - Omnilingual ASR: Open-Source Multilingual Speech Recognition for 1600+ Languages [76.14451035425229]
We introduce Omnilingual ASR, a large-scale automatic speech recognition system.<n>It scales self-supervised pre-training to 7B parameters to learn robust speech representations.<n>It expands coverage to over 1,600 languages, including over 500 never before served by ASR.
arXiv Detail & Related papers (2025-11-12T19:48:09Z) - Flavors of Moonshine: Tiny Specialized ASR Models for Edge Devices [1.4625828590961276]
We present a suite of tiny automatic speech recognition (ASR) models specialized for a range of underrepresented languages.<n>We release Arabic, Chinese, Japanese, Korean, Ukrainian, and Vietnamese models under a permissive open-source license.
arXiv Detail & Related papers (2025-09-02T17:22:54Z) - Arabic ASR on the SADA Large-Scale Arabic Speech Corpus with Transformer-Based Models [3.2669219874106608]
We evaluate the performance of several automatic speech recognition models on a large-scale Arabic speech dataset.<n>The dataset contains 668 hours of high-quality audio from Saudi television shows.<n>We find that the MMS 1B model finetuned on SADA with a 4-gram language model achieves a WER of 40.9% and a CER of 17.6% on the SADA test clean set.
arXiv Detail & Related papers (2025-08-18T14:44:25Z) - Open Automatic Speech Recognition Models for Classical and Modern Standard Arabic [15.807843278492847]
We introduce a universal methodology for Arabic speech and text processing designed to address unique challenges of the language.<n>We train two novel models based on the FastConformer architecture: one designed specifically for Modern Standard Arabic (MSA) and the other, the first unified public model for both MSA and Classical Arabic (CA)<n>The MSA model sets a new benchmark with state-of-the-art (SOTA) performance on related datasets, while the unified model achieves SOTA accuracy with diacritics for CA while maintaining strong performance for MSA.
arXiv Detail & Related papers (2025-07-18T14:42:18Z) - Efficient Multilingual ASR Finetuning via LoRA Language Experts [59.27778147311189]
This paper proposes an efficient finetuning framework for customized multilingual ASR via prepared LoRA language experts based on Whisper.<n>Through LoRA expert fusion or knowledge distillation, our approach achieves better recognition performance on target languages than standard fine-tuning methods.<n> Experimental results demonstrate that the proposed models yield approximately 10% and 15% relative performance gains in language-aware and language-agnostic scenarios.
arXiv Detail & Related papers (2025-06-11T07:06:27Z) - KIT's Low-resource Speech Translation Systems for IWSLT2025: System Enhancement with Synthetic Data and Model Regularization [57.08591486199925]
This paper presents KIT's submissions to the IWSLT 2025 low-resource track.<n>We develop both cascaded systems, and end-to-end (E2E) Speech Translation systems.<n>Building upon pre-trained models, we fine-tune our systems with different strategies to utilize resources efficiently.
arXiv Detail & Related papers (2025-05-26T08:38:02Z) - Improving Multilingual ASR in the Wild Using Simple N-best Re-ranking [68.77659513993507]
We present a simple and effective N-best re-ranking approach to improve multilingual ASR accuracy.
Our results show spoken language identification accuracy improvements of 8.7% and 6.1%, respectively, and word error rates which are 3.3% and 2.0% lower on these benchmarks.
arXiv Detail & Related papers (2024-09-27T03:31:32Z) - Self-supervised Speech Representations Still Struggle with African American Vernacular English [28.223877889211803]
Underperformance of ASR systems for speakers of marginalized language varieties is a well-documented phenomenon.
We investigate whether or not the recent wave of Self-Supervised Learning speech models can close the gap in ASR performance between AAVE and Mainstream American English.
arXiv Detail & Related papers (2024-08-26T13:29:25Z) - Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - Adapting Multi-Lingual ASR Models for Handling Multiple Talkers [63.151811561972515]
State-of-the-art large-scale universal speech models (USMs) show a decent automatic speech recognition (ASR) performance across multiple domains and languages.
We propose an approach to adapt USMs for multi-talker ASR.
We first develop an enhanced version of serialized output training to jointly perform multi-talker ASR and utterance timestamp prediction.
arXiv Detail & Related papers (2023-05-30T05:05:52Z) - Iteratively Improving Speech Recognition and Voice Conversion [10.514009693947227]
We first train an ASR model which is used to ensure content preservation while training a VC model.
In the next iteration, the VC model is used as a data augmentation method to further fine-tune the ASR model and generalize it to diverse speakers.
By iteratively leveraging the improved ASR model to train VC model and vice-versa, we experimentally show improvement in both the models.
arXiv Detail & Related papers (2023-05-24T11:45:42Z) - From English to More Languages: Parameter-Efficient Model Reprogramming
for Cross-Lingual Speech Recognition [50.93943755401025]
We propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition.
We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement.
Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses.
arXiv Detail & Related papers (2023-01-19T02:37:56Z) - Bangla-Wave: Improving Bangla Automatic Speech Recognition Utilizing
N-gram Language Models [0.0]
We show how to significantly improve the performance of an ASR model by adding an n-gram language model as a post-processor.
We generate a robust Bangla ASR model that is better than the existing ASR models.
arXiv Detail & Related papers (2022-09-13T17:59:21Z) - Data Augmentation for Low-Resource Quechua ASR Improvement [2.260916274164351]
Deep learning methods have made it possible to deploy systems with word error rates below 5% for ASR of English.
For so-called low-resource languages, methods of creating new resources on the basis of existing ones are being investigated.
We describe our data augmentation approach to improve the results of ASR models for low-resource and agglutinative languages.
arXiv Detail & Related papers (2022-07-14T12:49:15Z) - LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition [148.43282526983637]
We develop LRSpeech, a TTS and ASR system for languages with low data cost.
We conduct experiments on an experimental language (English) and a truly low-resource language (Lithuanian) to verify the effectiveness of LRSpeech.
We are currently deploying LRSpeech into a commercialized cloud speech service to support TTS on more rare languages.
arXiv Detail & Related papers (2020-08-09T08:16:33Z) - Phone Features Improve Speech Translation [69.54616570679343]
End-to-end models for speech translation (ST) more tightly couple speech recognition (ASR) and machine translation (MT)
We compare cascaded and end-to-end models across high, medium, and low-resource conditions, and show that cascades remain stronger baselines.
We show that these features improve both architectures, closing the gap between end-to-end models and cascades, and outperforming previous academic work -- by up to 9 BLEU on our low-resource setting.
arXiv Detail & Related papers (2020-05-27T22:05:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.