Related papers: VietASR: Achieving Industry-level Vietnamese ASR with 50-hour labeled data and Large-Scale Speech Pretraining

VietASR: Achieving Industry-level Vietnamese ASR with 50-hour labeled data and Large-Scale Speech Pretraining

URL: http://arxiv.org/abs/2505.21527v2
Date: Thu, 29 May 2025 12:55:12 GMT
Title: VietASR: Achieving Industry-level Vietnamese ASR with 50-hour labeled data and Large-Scale Speech Pretraining
Authors: Jianheng Zhuo, Yifan Yang, Yiwen Shao, Yong Xu, Dong Yu, Kai Yu, Xie Chen,
Abstract summary: We propose VietASR, a novel ASR training pipeline that leverages vast amounts of unlabeled data and a small set of labeled data.<n>We show that pre-training on 70,000-hour unlabeled data and fine-tuning on merely 50-hour labeled data yield a lightweight but powerful ASR model.<n>Our code and models will be open-sourced to facilitate research in low-resource ASR.
Score: 41.555790191562224
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Automatic speech recognition (ASR) has made remarkable progress but heavily relies on large-scale labeled data, which is scarce for low-resource languages like Vietnamese. While existing systems such as Whisper, USM, and MMS achieve promising performance, their efficacy remains inadequate in terms of training costs, latency, and accessibility. To address these issues, we propose VietASR, a novel ASR training pipeline that leverages vast amounts of unlabeled data and a small set of labeled data. Through multi-iteration ASR-biased self-supervised learning on a large-scale unlabeled dataset, VietASR offers a cost-effective and practical solution for enhancing ASR performance. Experiments demonstrate that pre-training on 70,000-hour unlabeled data and fine-tuning on merely 50-hour labeled data yield a lightweight but powerful ASR model. It outperforms Whisper Large-v3 and commercial ASR systems on real-world data. Our code and models will be open-sourced to facilitate research in low-resource ASR.

Related papers

Efficient ASR for Low-Resource Languages: Leveraging Cross-Lingual Unlabeled Data [5.324230283177818]
We present a systematic investigation into cross-lingual continuous pretraining for low-resource languages.<n>We construct a 3,000-hour multilingual corpus through a scalable unlabeled data collection pipeline.<n>We employ targeted continual pretraining combined with morphologically-aware tokenization to develop a 300M parameter model that achieves performance comparable to systems 5 times larger.
arXiv Detail & Related papers (2025-12-08T08:16:34Z)
How much speech data is necessary for ASR in African languages? An evaluation of data scaling in Kinyarwanda and Kikuyu [0.5678475267829229]
Development of Automatic Speech Recognition systems for low-resource African languages remains challenging due to limited transcribed speech data.<n>Recent advances in large multilingual models like OpenAI's Whisper offer promising pathways for low-resource ASR development.<n>We evaluate Whisper's performance through comprehensive experiments on two Bantu languages.
arXiv Detail & Related papers (2025-10-08T16:55:28Z)
ViCocktail: Automated Multi-Modal Data Collection for Vietnamese Audio-Visual Speech Recognition [4.0048516930686535]
We present a practical approach to generate AVSR datasets from raw video.<n>We demonstrate its broad applicability by developing a baseline AVSR model for Vietnamese.
arXiv Detail & Related papers (2025-06-05T05:13:01Z)
Dynamic Data Pruning for Automatic Speech Recognition [58.95758272440217]
We introduce Dynamic Data Pruning for ASR (DDP-ASR), which offers fine-grained pruning granularities specifically tailored for speech-related datasets. Our experiments show that DDP-ASR can save up to 1.6x training time with negligible performance loss.
arXiv Detail & Related papers (2024-06-26T14:17:36Z)
Enabling ASR for Low-Resource Languages: A Comprehensive Dataset Creation Approach [0.6445605125467574]
This study introduces a novel pipeline designed to generate ASR training datasets from audiobooks. The common structure of these audiobooks poses a unique challenge due to the extensive length of audio segments. We propose a method for effectively aligning audio with its corresponding text and segmenting it into lengths suitable for ASR training.
arXiv Detail & Related papers (2024-06-03T15:38:40Z)
Advancing African-Accented Speech Recognition: Epistemic Uncertainty-Driven Data Selection for Generalizable ASR Models [2.4654745083407175]
We propose a new multi-rounds adaptation process that uses uncertainty to automate the annotation process. This novel method streamlines data annotation and strategically selects data samples contributing most to model uncertainty. Our results show that our approach leads to a 27% WER relative average improvement while requiring on average 45% less data than established baselines.
arXiv Detail & Related papers (2023-06-03T13:11:37Z)
Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels [100.43280310123784]
We investigate the use of automatically-generated transcriptions of unlabelled datasets to increase the training set size. We demonstrate that increasing the size of the training set, a recent trend in the literature, leads to reduced WER despite using noisy transcriptions. The proposed model achieves new state-of-the-art performance on AV-ASR on LRS2 and LRS3.
arXiv Detail & Related papers (2023-03-25T00:37:34Z)
Data Augmentation for Low-Resource Quechua ASR Improvement [2.260916274164351]
Deep learning methods have made it possible to deploy systems with word error rates below 5% for ASR of English. For so-called low-resource languages, methods of creating new resources on the basis of existing ones are being investigated. We describe our data augmentation approach to improve the results of ASR models for low-resource and agglutinative languages.
arXiv Detail & Related papers (2022-07-14T12:49:15Z)
BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition [126.5605160882849]
We find that the combination of pre-training, self-training and scaling up model size greatly increases data efficiency. We report on the universal benefits gained from using big pre-trained and self-trained models for a large set of downstream tasks.
arXiv Detail & Related papers (2021-09-27T17:59:19Z)
SynthASR: Unlocking Synthetic Data for Speech Recognition [15.292920497489925]
We propose to utilize synthetic speech for ASR training ( SynthASR) in applications where data is sparse or hard to get for ASR model training. In our experiments conducted on in-house datasets for a new application of recognizing medication names, training ASR RNN-T models with synthetic audio improved the recognition performance on new application by more than 65% relative.
arXiv Detail & Related papers (2021-06-14T23:26:44Z)
Dual-mode ASR: Unify and Improve Streaming ASR with Full-context Modeling [76.43479696760996]
We propose a unified framework, Dual-mode ASR, to train a single end-to-end ASR model with shared weights for both streaming and full-context speech recognition. We show that the latency and accuracy of streaming ASR significantly benefit from weight sharing and joint training of full-context ASR.
arXiv Detail & Related papers (2020-10-12T21:12:56Z)
Improving Readability for Automatic Speech Recognition Transcription [50.86019112545596]
We propose a novel NLP task called ASR post-processing for readability (APR) APR aims to transform the noisy ASR output into a readable text for humans and downstream tasks while maintaining the semantic meaning of the speaker. We compare fine-tuned models based on several open-sourced and adapted pre-trained models with the traditional pipeline method.
arXiv Detail & Related papers (2020-04-09T09:26:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.