TG-ASR: Translation-Guided Learning with Parallel Gated Cross Attention for Low-Resource Automatic Speech Recognition
- URL: http://arxiv.org/abs/2602.22039v1
- Date: Wed, 25 Feb 2026 15:47:34 GMT
- Title: TG-ASR: Translation-Guided Learning with Parallel Gated Cross Attention for Low-Resource Automatic Speech Recognition
- Authors: Cheng-Yeh Yang, Chien-Chun Wang, Li-Wei Chen, Hung-Shin Lee, Hsin-Min Wang, Berlin Chen,
- Abstract summary: TG-ASR for Taiwanese Hokkien drama speech recognition uses multilingual translation embeddings to enhance recognition performance.<n>We present YT-THDC, a 30-hour corpus of Taiwanese Hokkien drama speech with aligned Mandarin subtitles and manually verified Taiwanese Hokkien transcriptions.
- Score: 26.398499487395295
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Low-resource automatic speech recognition (ASR) continues to pose significant challenges, primarily due to the limited availability of transcribed data for numerous languages. While a wealth of spoken content is accessible in television dramas and online videos, Taiwanese Hokkien exemplifies this issue, with transcriptions often being scarce and the majority of available subtitles provided only in Mandarin. To address this deficiency, we introduce TG-ASR for Taiwanese Hokkien drama speech recognition, a translation-guided ASR framework that utilizes multilingual translation embeddings to enhance recognition performance in low-resource environments. The framework is centered around the parallel gated cross-attention (PGCA) mechanism, which adaptively integrates embeddings from various auxiliary languages into the ASR decoder. This mechanism facilitates robust cross-linguistic semantic guidance while ensuring stable optimization and minimizing interference between languages. To support ongoing research initiatives, we present YT-THDC, a 30-hour corpus of Taiwanese Hokkien drama speech with aligned Mandarin subtitles and manually verified Taiwanese Hokkien transcriptions. Comprehensive experiments and analyses identify the auxiliary languages that most effectively enhance ASR performance, achieving a 14.77% relative reduction in character error rate and demonstrating the efficacy of translation-guided learning for underrepresented languages in practical applications.
Related papers
- Typhoon ASR Real-time: FastConformer-Transducer for Thai Automatic Speech Recognition [12.692166506908803]
We present Typhoon ASR Real-time, a 115M- parameter FastConformer-Transducer model for low-latency Thai speech recognition.<n>Our compact model achieves a 45x reduction in computational cost compared to Whisper Large-v3 while delivering comparable accuracy.<n>To address challenges in Thai ASR, we release the Typhoon ASR Benchmark, a gold-standard human-labeled datasets with transcriptions following established Thai linguistic conventions.
arXiv Detail & Related papers (2026-01-19T13:28:17Z) - SITA: Learning Speaker-Invariant and Tone-Aware Speech Representations for Low-Resource Tonal Languages [11.655315357810371]
SITA is a lightweight adaptation recipe that enforces Speaker-Invariance and Tone-Awareness for pretrained wav2vec-style encoders.<n>We evaluate primarily on Hmong, a highly tonal and severely under-the-shelf multilingual encoders fail to represent tone effectively.
arXiv Detail & Related papers (2026-01-14T00:42:27Z) - CLiFT-ASR: A Cross-Lingual Fine-Tuning Framework for Low-Resource Taiwanese Hokkien Speech Recognition [12.323666705980672]
CLiFT-ASR is a cross-lingual fine-tuning framework for speech recognition in Taiwanese Hokkien.<n>It first learns acoustic and tonal representations from phonetic Tai-lo annotations and then captures vocabulary and syntax from Han-character transcriptions.<n>Experiments on the TAT-MOE corpus demonstrate that CLiFT-ASR achieves a 24.88% relative reduction in character error rate.
arXiv Detail & Related papers (2025-11-10T09:03:30Z) - Towards Unsupervised Speech Recognition at the Syllable-Level [95.54031547995874]
We introduce a syllable-level UASR framework based on masked language modeling.<n>We generalize effectively to Mandarin, a language that has remained particularly difficult for prior methods.
arXiv Detail & Related papers (2025-10-04T02:56:33Z) - Refining Transcripts With TV Subtitles by Prompt-Based Weakly Supervised Training of ASR [15.311893064721858]
This study proposes a novel approach to using TV subtitles within a weakly supervised (WS) Automatic Speech Recognition (ASR) framework.<n>Rather than using subtitles as direct supervision signals, our method reimagines them as context-rich prompts.<n> generated pseudo transcripts become the primary targets, with subtitles acting as guiding cues for iterative refinement.
arXiv Detail & Related papers (2025-09-01T11:43:07Z) - Towards Inclusive Communication: A Unified Framework for Generating Spoken Language from Sign, Lip, and Audio [52.859261069569165]
We propose the first unified framework capable of handling diverse combinations of sign language, lip movements, and audio for spoken-language text generation.<n>We focus on three main objectives: (i) designing a unified, modality-agnostic architecture capable of effectively processing heterogeneous inputs; (ii) exploring the underexamined synergy among modalities, particularly the role of lip movements as non-manual cues in sign language comprehension; and (iii) achieving performance on par with or better than state-of-the-art models specialized for individual tasks.
arXiv Detail & Related papers (2025-08-28T06:51:42Z) - Whisper Speaker Identification: Leveraging Pre-Trained Multilingual Transformers for Robust Speaker Embeddings [0.0]
We propose WSI (Whisper Speaker Identification), a framework that repurposes the Whisper automatic speech recognition model pre trained on extensive multilingual data.<n>By capitalizing on Whisper language-agnostic acoustic representations, our approach effectively distinguishes speakers across diverse languages.
arXiv Detail & Related papers (2025-03-13T15:11:28Z) - Enhancing Multilingual ASR for Unseen Languages via Language Embedding Modeling [50.62091603179394]
Whisper, one of the most advanced ASR models, handles 99 languages effectively.<n>However, Whisper struggles with unseen languages, those not included in its pre-training.<n>We propose methods that exploit these relationships to enhance ASR performance on unseen languages.
arXiv Detail & Related papers (2024-12-21T04:05:43Z) - LAMASSU: Streaming Language-Agnostic Multilingual Speech Recognition and
Translation Using Neural Transducers [71.76680102779765]
Automatic speech recognition (ASR) and speech translation (ST) can both use neural transducers as the model structure.
We propose LAMASSU, a streaming language-agnostic multilingual speech recognition and translation model using neural transducers.
arXiv Detail & Related papers (2022-11-05T04:03:55Z) - Cross-lingual Transfer for Speech Processing using Acoustic Language
Similarity [81.51206991542242]
Cross-lingual transfer offers a compelling way to help bridge this digital divide.
Current cross-lingual algorithms have shown success in text-based tasks and speech-related tasks over some low-resource languages.
We propose a language similarity approach that can efficiently identify acoustic cross-lingual transfer pairs across hundreds of languages.
arXiv Detail & Related papers (2021-11-02T01:55:17Z) - Cross-lingual Machine Reading Comprehension with Language Branch
Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages.
We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC)
LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language.
We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z) - How Phonotactics Affect Multilingual and Zero-shot ASR Performance [74.70048598292583]
A Transformer encoder-decoder model has been shown to leverage multilingual data well in IPA transcriptions of languages presented during training.
We replace the encoder-decoder with a hybrid ASR system consisting of a separate AM and LM.
We show that the gain from modeling crosslingual phonotactics is limited, and imposing a too strong model can hurt the zero-shot transfer.
arXiv Detail & Related papers (2020-10-22T23:07:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.