TSPC: A Two-Stage Phoneme-Centric Architecture for code-switching Vietnamese-English Speech Recognition
- URL: http://arxiv.org/abs/2509.05983v3
- Date: Sat, 20 Sep 2025 14:15:55 GMT
- Title: TSPC: A Two-Stage Phoneme-Centric Architecture for code-switching Vietnamese-English Speech Recognition
- Authors: Minh N. H. Nguyen, Anh Nguyen Tran, Dung Truong Dinh, Nam Van Vo,
- Abstract summary: Code-switching (CS) presents a significant challenge for general Auto-Speech Recognition (ASR) systems.<n>Existing methods often fail to capture the subtle phonological shifts inherent in CS scenarios.<n>We propose a novel architecture for Vietnamese-English CS ASR, a Two-Stage Phoneme-Centric model (TSPC)
- Score: 0.855801641444342
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Code-switching (CS) presents a significant challenge for general Auto-Speech Recognition (ASR) systems. Existing methods often fail to capture the subtle phonological shifts inherent in CS scenarios. The challenge is particularly difficult for language pairs like Vietnamese and English, where both distinct phonological features and the ambiguity arising from similar sound recognition are present. In this paper, we propose a novel architecture for Vietnamese-English CS ASR, a Two-Stage Phoneme-Centric model (TSPC). The TSPC employs a phoneme-centric approach, built upon an extended Vietnamese phoneme set as an intermediate representation to facilitate mixed-lingual modeling. Experimental results demonstrate that TSPC consistently outperforms existing baselines, including PhoWhisper-base, in Vietnamese-English CS ASR, achieving a significantly lower word error rate of 19.9% with reduced training resources. Furthermore, the phonetic-based two-stage architecture enables phoneme adaptation and language conversion to enhance ASR performance in complex CS Vietnamese-English ASR scenarios
Related papers
- ViSpeechFormer: A Phonemic Approach for Vietnamese Automatic Speech Recognition [7.250850162908686]
We propose ViSpeechFormer (textbfVietnamese textbfSpeech TranstextbfFormer), a phoneme-based approach for Vietnamese Automatic Speech Recognition (ASR)<n>Experiments on two publicly available Vietnamese ASR datasets show that ViSpeechFormer achieves strong performance, generalizes better to out-of-vocabulary words, and is less affected by training bias.
arXiv Detail & Related papers (2026-02-10T17:26:55Z) - POWSM: A Phonetic Open Whisper-Style Speech Foundation Model [50.73202227472358]
POWSM is the first unified framework capable of jointly performing multiple phone-related tasks.<n>Our training data, code and models are released to foster open science.
arXiv Detail & Related papers (2025-10-28T21:43:45Z) - Towards Unsupervised Speech Recognition at the Syllable-Level [95.54031547995874]
We introduce a syllable-level UASR framework based on masked language modeling.<n>We generalize effectively to Mandarin, a language that has remained particularly difficult for prior methods.
arXiv Detail & Related papers (2025-10-04T02:56:33Z) - Whisper based Cross-Lingual Phoneme Recognition between Vietnamese and English [0.0]
Cross-lingual phoneme recognition has emerged as a significant challenge for accurate automatic speech recognition.<n>English features stress patterns and non-standard pronunciations that hinder phoneme alignment between the two languages.<n>We propose a novel bilingual speech recognition approach with two primary contributions.
arXiv Detail & Related papers (2025-08-22T09:10:24Z) - Optimizing ASR for Catalan-Spanish Code-Switching: A Comparative Analysis of Methodologies [9.224033819309708]
Code-switching (CS), the alternating use of two or more languages, challenges automatic speech recognition (ASR) due to scarce training data and linguistic similarities.<n>We improve ASR for Catalan-Spanish CS by exploring three strategies: (1) generating synthetic CS data, (2) concatenating monolingual audio, and (3) leveraging real CS data with language tokens.<n>Results show that combining a modest amount of synthetic CS data with the dominant language token yields the best transcription performance.
arXiv Detail & Related papers (2025-07-18T12:54:41Z) - LLM-based phoneme-to-grapheme for phoneme-based speech recognition [11.552927239284582]
We propose phoneme-to-grapheme (LLM-P2G) decoding for phoneme-based automatic speech recognition (ASR)<n>Our experimental results show that LLM-P2G outperforms WFST-based systems in crosslingual ASR for Polish and German, by relative WER reductions of 3.6% and 6.9% respectively.
arXiv Detail & Related papers (2025-06-05T07:35:55Z) - UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation [34.57020177838285]
Cued Speech (CS) enhances lipreading via hand coding, offering visual phonemic cues that support precise speech perception for the hearing-impaired.<n>The task of CS Video-to-Speech generation (CSV2S) aims to convert CS videos into intelligible speech signals.<n>We propose UniCUE, the first CSV2S that directly generates speech from CS videos without relying on intermediate text.
arXiv Detail & Related papers (2025-06-04T16:26:49Z) - AdaCS: Adaptive Normalization for Enhanced Code-Switching ASR [1.8533128809847572]
Intra-sentential code-switching is a significant challenge for Automatic Speech Recognition systems.<n>AdaCS is a normalization model integrates an adaptive bias attention module into encoder-decoder network.<n> Experiments show that AdaCS outperforms previous state-of-the-art method on Vietnamese CS ASR normalization.
arXiv Detail & Related papers (2025-01-13T07:27:00Z) - Speech collage: code-switched audio generation by collaging monolingual
corpora [50.356820349870986]
Speech Collage is a method that synthesizes CS data from monolingual corpora by splicing audio segments.
We investigate the impact of generated data on speech recognition in two scenarios.
arXiv Detail & Related papers (2023-09-27T14:17:53Z) - Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - Language-agnostic Code-Switching in Sequence-To-Sequence Speech
Recognition [62.997667081978825]
Code-Switching (CS) is referred to the phenomenon of alternately using words and phrases from different languages.
We propose a simple yet effective data augmentation in which audio and corresponding labels of different source languages are transcribed.
We show that this augmentation can even improve the model's performance on inter-sentential language switches not seen during training by 5,03% WER.
arXiv Detail & Related papers (2022-10-17T12:15:57Z) - Few-Shot Cross-Lingual TTS Using Transferable Phoneme Embedding [55.989376102986654]
This paper studies a transferable phoneme embedding framework that aims to deal with the cross-lingual text-to-speech problem under the few-shot setting.
We propose a framework that consists of a phoneme-based TTS model and a codebook module to project phonemes from different languages into a learned latent space.
arXiv Detail & Related papers (2022-06-27T11:24:40Z) - Streaming End-to-End Bilingual ASR Systems with Joint Language
Identification [19.09014345299161]
We introduce streaming, end-to-end, bilingual systems that perform both ASR and language identification.
The proposed method is applied to two language pairs: English-Spanish as spoken in the United States, and English-Hindi as spoken in India.
arXiv Detail & Related papers (2020-07-08T05:00:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.