Related papers: Typhoon ASR Real-time: FastConformer-Transducer for Thai Automatic Speech Recognition

Typhoon ASR Real-time: FastConformer-Transducer for Thai Automatic Speech Recognition

URL: http://arxiv.org/abs/2601.13044v1
Date: Mon, 19 Jan 2026 13:28:17 GMT
Title: Typhoon ASR Real-time: FastConformer-Transducer for Thai Automatic Speech Recognition
Authors: Warit Sirichotedumrong, Adisai Na-Thalang, Potsawee Manakul, Pittawat Taveekitworachai, Sittipong Sripaisarnmongkol, Kunat Pipatanakul,
Abstract summary: We present Typhoon ASR Real-time, a 115M- parameter FastConformer-Transducer model for low-latency Thai speech recognition.<n>Our compact model achieves a 45x reduction in computational cost compared to Whisper Large-v3 while delivering comparable accuracy.<n>To address challenges in Thai ASR, we release the Typhoon ASR Benchmark, a gold-standard human-labeled datasets with transcriptions following established Thai linguistic conventions.
Score: 12.692166506908803
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large encoder-decoder models like Whisper achieve strong offline transcription but remain impractical for streaming applications due to high latency. However, due to the accessibility of pre-trained checkpoints, the open Thai ASR landscape remains dominated by these offline architectures, leaving a critical gap in efficient streaming solutions. We present Typhoon ASR Real-time, a 115M-parameter FastConformer-Transducer model for low-latency Thai speech recognition. We demonstrate that rigorous text normalization can match the impact of model scaling: our compact model achieves a 45x reduction in computational cost compared to Whisper Large-v3 while delivering comparable accuracy. Our normalization pipeline resolves systemic ambiguities in Thai transcription --including context-dependent number verbalization and repetition markers (mai yamok) --creating consistent training targets. We further introduce a two-stage curriculum learning approach for Isan (north-eastern) dialect adaptation that preserves Central Thai performance. To address reproducibility challenges in Thai ASR, we release the Typhoon ASR Benchmark, a gold-standard human-labeled datasets with transcriptions following established Thai linguistic conventions, providing standardized evaluation protocols for the research community.

Related papers

Boosting ASR Robustness via Test-Time Reinforcement Learning with Audio-Text Semantic Rewards [8.109014000578766]
We present ASR-TRA, a novel Testtime Reinforcement Adaptation framework inspired by causal intervention.<n>Our method achieves higher accuracy while maintaining lower latency than existing TTA baselines.<n>Our approach provides a practical and robust solution for deploying ASR systems in challenging real-world conditions.
arXiv Detail & Related papers (2026-03-05T14:43:15Z)
TG-ASR: Translation-Guided Learning with Parallel Gated Cross Attention for Low-Resource Automatic Speech Recognition [26.398499487395295]
TG-ASR for Taiwanese Hokkien drama speech recognition uses multilingual translation embeddings to enhance recognition performance.<n>We present YT-THDC, a 30-hour corpus of Taiwanese Hokkien drama speech with aligned Mandarin subtitles and manually verified Taiwanese Hokkien transcriptions.
arXiv Detail & Related papers (2026-02-25T15:47:34Z)
A Self-Refining Framework for Enhancing ASR Using TTS-Synthesized Data [46.73430446242378]
We propose a self-refining framework that enhances ASR performance with only unlabeled datasets.<n>We demonstrate the effectiveness of the framework on Taiwanese Mandarin speech.
arXiv Detail & Related papers (2025-06-10T17:30:32Z)
SUTA-LM: Bridging Test-Time Adaptation and Language Model Rescoring for Robust ASR [58.31068047426522]
Test-Time Adaptation (TTA) aims to mitigate by adjusting models during inference.<n>Recent work explores combining TTA with external language models, using techniques like beam search rescoring or generative error correction.<n>We propose SUTA-LM, a simple yet effective extension of SUTA, with language model rescoring.<n> Experiments on 18 diverse ASR datasets show that SUTA-LM achieves robust results across a wide range of domains.
arXiv Detail & Related papers (2025-06-10T02:50:20Z)
HENT-SRT: Hierarchical Efficient Neural Transducer with Self-Distillation for Joint Speech Recognition and Translation [19.997594859651233]
HENT-SRT is a novel framework that factorizes ASR and translation tasks to better handle reordering.<n>We improve computational efficiency by incorporating best practices from ASR transducers.<n>Our approach is evaluated on three conversational datasets Arabic, Spanish, and Mandarin.
arXiv Detail & Related papers (2025-06-02T18:37:50Z)
Dynamic Context-Aware Streaming Pretrained Language Model For Inverse Text Normalization [0.19791587637442667]
Inverse Text Normalization (ITN) is crucial for converting spoken Automatic Speech Recognition (ASR) outputs into well-formatted written text.<n>We introduce a streaming pretrained language model for ITN, leveraging pretrained linguistic representations for improved robustness.<n>Our method achieves accuracy comparable to non-streaming ITN and surpasses existing streaming ITN models on a Vietnamese dataset.
arXiv Detail & Related papers (2025-05-30T05:41:03Z)
Token-Level Serialized Output Training for Joint Streaming ASR and ST Leveraging Textual Alignments [49.38965743465124]
This paper introduces a streaming Transformer-Transducer that jointly generates automatic speech recognition (ASR) and speech translation (ST) outputs using a single decoder. Experiments in monolingual and multilingual settings demonstrate that our approach achieves the best quality-latency balance.
arXiv Detail & Related papers (2023-07-07T02:26:18Z)
Sequence-level self-learning with multiple hypotheses [53.04725240411895]
We develop new self-learning techniques with an attention-based sequence-to-sequence (seq2seq) model for automatic speech recognition (ASR) In contrast to conventional unsupervised learning approaches, we adopt the emphmulti-task learning (MTL) framework. Our experiment results show that our method can reduce the WER on the British speech data from 14.55% to 10.36% compared to the baseline model trained with the US English data only.
arXiv Detail & Related papers (2021-12-10T20:47:58Z)
The USYD-JD Speech Translation System for IWSLT 2021 [85.64797317290349]
This paper describes the University of Sydney& JD's joint submission of the IWSLT 2021 low resource speech translation task. We trained our models with the officially provided ASR and MT datasets. To achieve better translation performance, we explored the most recent effective strategies, including back translation, knowledge distillation, multi-feature reranking and transductive finetuning.
arXiv Detail & Related papers (2021-07-24T09:53:34Z)
Improving Cross-Lingual Transfer Learning for End-to-End Speech Recognition with Speech Translation [63.16500026845157]
We introduce speech-to-text translation as an auxiliary task to incorporate additional knowledge of the target language. We show that training ST with human translations is not necessary. Even with pseudo-labels from low-resource MT (200K examples), ST-enhanced transfer brings up to 8.9% WER reduction to direct transfer.
arXiv Detail & Related papers (2020-06-09T19:34:11Z)
Adapting End-to-End Speech Recognition for Readable Subtitles [15.525314212209562]
In some use cases such as subtitling, verbatim transcription would reduce output readability given limited screen size and reading time. We first investigate a cascaded system, where an unsupervised compression model is used to post-edit the transcribed speech. Experiments show that with limited data far less than needed for training a model from scratch, we can adapt a Transformer-based ASR model to incorporate both transcription and compression capabilities.
arXiv Detail & Related papers (2020-05-25T14:42:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.