Related papers: CLiFT-ASR: A Cross-Lingual Fine-Tuning Framework for Low-Resource Taiwanese Hokkien Speech Recognition

CLiFT-ASR: A Cross-Lingual Fine-Tuning Framework for Low-Resource Taiwanese Hokkien Speech Recognition

URL: http://arxiv.org/abs/2511.06860v1
Date: Mon, 10 Nov 2025 09:03:30 GMT
Title: CLiFT-ASR: A Cross-Lingual Fine-Tuning Framework for Low-Resource Taiwanese Hokkien Speech Recognition
Authors: Hung-Yang Sung, Chien-Chun Wang, Kuan-Tang Huang, Tien-Hong Lo, Yu-Sheng Tsao, Yung-Chang Hsu, Berlin Chen,
Abstract summary: CLiFT-ASR is a cross-lingual fine-tuning framework for speech recognition in Taiwanese Hokkien.<n>It first learns acoustic and tonal representations from phonetic Tai-lo annotations and then captures vocabulary and syntax from Han-character transcriptions.<n>Experiments on the TAT-MOE corpus demonstrate that CLiFT-ASR achieves a 24.88% relative reduction in character error rate.
Score: 12.323666705980672
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automatic speech recognition (ASR) for low-resource languages such as Taiwanese Hokkien is difficult due to the scarcity of annotated data. However, direct fine-tuning on Han-character transcriptions often fails to capture detailed phonetic and tonal cues, while training only on romanization lacks lexical and syntactic coverage. In addition, prior studies have rarely explored staged strategies that integrate both annotation types. To address this gap, we present CLiFT-ASR, a cross-lingual fine-tuning framework that builds on Mandarin HuBERT models and progressively adapts them to Taiwanese Hokkien. The framework employs a two-stage process in which it first learns acoustic and tonal representations from phonetic Tai-lo annotations and then captures vocabulary and syntax from Han-character transcriptions. This progressive adaptation enables effective alignment between speech sounds and orthographic structures. Experiments on the TAT-MOE corpus demonstrate that CLiFT-ASR achieves a 24.88\% relative reduction in character error rate (CER) compared with strong baselines. The results indicate that CLiFT-ASR provides an effective and parameter-efficient solution for Taiwanese Hokkien ASR and that it has potential to benefit other low-resource language scenarios.

Related papers

Linguistically Informed Graph Model and Semantic Contrastive Learning for Korean Short Text Classification [2.4071330817126477]
We propose LIGRAM, a hierarchical heterogeneous graph model for Korean short-text classification.<n>The proposed model constructs sub-graphs at the morpheme, part-of-speech, and named-entity levels and hierarchically integrates them to compensate for the limited contextual information in short texts.<n>We evaluate LIGRAM on four Korean short-text datasets, where it consistently outperforms existing baseline models.
arXiv Detail & Related papers (2026-03-04T02:17:13Z)
TG-ASR: Translation-Guided Learning with Parallel Gated Cross Attention for Low-Resource Automatic Speech Recognition [26.398499487395295]
TG-ASR for Taiwanese Hokkien drama speech recognition uses multilingual translation embeddings to enhance recognition performance.<n>We present YT-THDC, a 30-hour corpus of Taiwanese Hokkien drama speech with aligned Mandarin subtitles and manually verified Taiwanese Hokkien transcriptions.
arXiv Detail & Related papers (2026-02-25T15:47:34Z)
SITA: Learning Speaker-Invariant and Tone-Aware Speech Representations for Low-Resource Tonal Languages [11.655315357810371]
SITA is a lightweight adaptation recipe that enforces Speaker-Invariance and Tone-Awareness for pretrained wav2vec-style encoders.<n>We evaluate primarily on Hmong, a highly tonal and severely under-the-shelf multilingual encoders fail to represent tone effectively.
arXiv Detail & Related papers (2026-01-14T00:42:27Z)
WESR: Scaling and Evaluating Word-level Event-Speech Recognition [59.21814194620928]
Speech conveys not only linguistic information but also rich non-verbal vocal events such as laughing and crying.<n>We develop a refined taxonomy of 21 vocal events, with a new categorization into discrete (standalone) versus continuous (mixed with speech) types.<n>Based on the refined taxonomy, we introduce WESR-Bench, an expert-annotated evaluation set (900+ utterances) with a novel position-aware protocol.
arXiv Detail & Related papers (2026-01-08T02:23:21Z)
Towards Unsupervised Speech Recognition at the Syllable-Level [95.54031547995874]
We introduce a syllable-level UASR framework based on masked language modeling.<n>We generalize effectively to Mandarin, a language that has remained particularly difficult for prior methods.
arXiv Detail & Related papers (2025-10-04T02:56:33Z)
Building Tailored Speech Recognizers for Japanese Speaking Assessment [6.152272170188488]
We build a speech recognizer that outputs phonemic labels with accent markers.<n>Although Japanese is resource-rich, there is only a small amount of data for training models to produce accurate phonemic transcriptions.
arXiv Detail & Related papers (2025-09-25T01:26:11Z)
TSPC: A Two-Stage Phoneme-Centric Architecture for code-switching Vietnamese-English Speech Recognition [0.855801641444342]
Code-switching (CS) presents a significant challenge for general Auto-Speech Recognition (ASR) systems.<n>Existing methods often fail to capture the subtle phonological shifts inherent in CS scenarios.<n>We propose a novel architecture for Vietnamese-English CS ASR, a Two-Stage Phoneme-Centric model (TSPC)
arXiv Detail & Related papers (2025-09-07T09:19:03Z)
Towards Inclusive Communication: A Unified Framework for Generating Spoken Language from Sign, Lip, and Audio [52.859261069569165]
We propose the first unified framework capable of handling diverse combinations of sign language, lip movements, and audio for spoken-language text generation.<n>We focus on three main objectives: (i) designing a unified, modality-agnostic architecture capable of effectively processing heterogeneous inputs; (ii) exploring the underexamined synergy among modalities, particularly the role of lip movements as non-manual cues in sign language comprehension; and (iii) achieving performance on par with or better than state-of-the-art models specialized for individual tasks.
arXiv Detail & Related papers (2025-08-28T06:51:42Z)
Languages in Multilingual Speech Foundation Models Align Both Phonetically and Semantically [58.019484208091534]
Cross-lingual alignment in pretrained language models (LMs) has enabled efficient transfer in text-based LMs.<n>It remains an open question whether findings and methods from text-based cross-lingual alignment apply to speech.
arXiv Detail & Related papers (2025-05-26T07:21:20Z)
Prosody in Cascade and Direct Speech-to-Text Translation: a case study on Korean Wh-Phrases [79.07111754406841]
This work proposes using contrastive evaluation to measure the ability of direct S2TT systems to disambiguate utterances where prosody plays a crucial role. Our results clearly demonstrate the value of direct translation systems over cascade translation models.
arXiv Detail & Related papers (2024-02-01T14:46:35Z)
Dual-Decoder Transformer For end-to-end Mandarin Chinese Speech Recognition with Pinyin and Character [15.999657143705045]
Pinyin and character as writing and spelling systems respectively are mutual promotion in the Mandarin Chinese language. We propose a novel Mandarin Chinese ASR model with dual-decoder Transformer according to the characteristics of pinyin transcripts and character transcripts. The results on the test sets of AISHELL-1 dataset show that the proposed Speech-Pinyin-Character-Interaction (S PCI) model without a language model achieves 9.85% character error rate (CER) on the test set.
arXiv Detail & Related papers (2022-01-26T07:59:03Z)
Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning for Low-Resource Speech Recognition [159.9312272042253]
Wav-BERT is a cooperative acoustic and linguistic representation learning method. We unify a pre-trained acoustic model (wav2vec 2.0) and a language model (BERT) into an end-to-end trainable framework.
arXiv Detail & Related papers (2021-09-19T16:39:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.