Related papers: Whisper based Cross-Lingual Phoneme Recognition between Vietnamese and English

Whisper based Cross-Lingual Phoneme Recognition between Vietnamese and English

URL: http://arxiv.org/abs/2508.19270v1
Date: Fri, 22 Aug 2025 09:10:24 GMT
Title: Whisper based Cross-Lingual Phoneme Recognition between Vietnamese and English
Authors: Nguyen Huu Nhat Minh, Tran Nguyen Anh, Truong Dinh Dung, Vo Van Nam, Le Pham Tuyen,
Abstract summary: Cross-lingual phoneme recognition has emerged as a significant challenge for accurate automatic speech recognition.<n>English features stress patterns and non-standard pronunciations that hinder phoneme alignment between the two languages.<n>We propose a novel bilingual speech recognition approach with two primary contributions.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Cross-lingual phoneme recognition has emerged as a significant challenge for accurate automatic speech recognition (ASR) when mixing Vietnamese and English pronunciations. Unlike many languages, Vietnamese relies on tonal variations to distinguish word meanings, whereas English features stress patterns and non-standard pronunciations that hinder phoneme alignment between the two languages. To address this challenge, we propose a novel bilingual speech recognition approach with two primary contributions: (1) constructing a representative bilingual phoneme set that bridges the differences between Vietnamese and English phonetic systems; (2) designing an end-to-end system that leverages the PhoWhisper pre-trained encoder for deep high-level representations to improve phoneme recognition. Our extensive experiments demonstrate that the proposed approach not only improves recognition accuracy in bilingual speech recognition for Vietnamese but also provides a robust framework for addressing the complexities of tonal and stress-based phoneme recognition

Related papers

ViSpeechFormer: A Phonemic Approach for Vietnamese Automatic Speech Recognition [7.250850162908686]
We propose ViSpeechFormer (textbfVietnamese textbfSpeech TranstextbfFormer), a phoneme-based approach for Vietnamese Automatic Speech Recognition (ASR)<n>Experiments on two publicly available Vietnamese ASR datasets show that ViSpeechFormer achieves strong performance, generalizes better to out-of-vocabulary words, and is less affected by training bias.
arXiv Detail & Related papers (2026-02-10T17:26:55Z)
Speaker Style-Aware Phoneme Anchoring for Improved Cross-Lingual Speech Emotion Recognition [58.74986434825755]
Cross-lingual speech emotion recognition is a challenging task due to differences in phonetic variability and speaker-specific expressive styles.<n>We propose a speaker-style aware phoneme anchoring framework that aligns emotional expression at the phonetic and speaker levels.<n>Our method builds emotion-specific speaker communities via graph-based clustering to capture shared speaker traits.
arXiv Detail & Related papers (2025-09-19T21:03:21Z)
TSPC: A Two-Stage Phoneme-Centric Architecture for code-switching Vietnamese-English Speech Recognition [0.855801641444342]
Code-switching (CS) presents a significant challenge for general Auto-Speech Recognition (ASR) systems.<n>Existing methods often fail to capture the subtle phonological shifts inherent in CS scenarios.<n>We propose a novel architecture for Vietnamese-English CS ASR, a Two-Stage Phoneme-Centric model (TSPC)
arXiv Detail & Related papers (2025-09-07T09:19:03Z)
Data-Driven Mispronunciation Pattern Discovery for Robust Speech Recognition [1.0323063834827417]
We propose two data-driven approaches to automatically detect mispronunciation patterns.<n>By aligning non-native phones with their native counterparts using attention maps, we achieved a 5.7% improvement in speech recognition on native English datasets.
arXiv Detail & Related papers (2025-02-01T22:41:43Z)
Optimizing Two-Pass Cross-Lingual Transfer Learning: Phoneme Recognition and Phoneme to Grapheme Translation [9.118302330129284]
This research optimize two-pass cross-lingual transfer learning in low-resource languages. We optimize phoneme vocabulary coverage by merging phonemes based on shared articulatory characteristics. We introduce a global phoneme noise generator for realistic ASR noise during phoneme-to-grapheme training to reduce error propagation.
arXiv Detail & Related papers (2023-12-06T06:37:24Z)
Enhancing Cross-lingual Transfer via Phonemic Transcription Integration [57.109031654219294]
PhoneXL is a framework incorporating phonemic transcriptions as an additional linguistic modality for cross-lingual transfer. Our pilot study reveals phonemic transcription provides essential information beyond the orthography to enhance cross-lingual transfer.
arXiv Detail & Related papers (2023-07-10T06:17:33Z)
Incorporating L2 Phonemes Using Articulatory Features for Robust Speech Recognition [2.8360662552057323]
This study is on the efficient incorporation of the L2 phonemes, which in this work refer to Korean phonemes, through articulatory feature analysis. We employ the lattice-free maximum mutual information (LF-MMI) objective in an end-to-end manner, to train the acoustic model to align and predict one of multiple pronunciation candidates. Experimental results show that the proposed method improves ASR accuracy for Korean L2 speech by training solely on L1 speech data.
arXiv Detail & Related papers (2023-06-05T01:55:33Z)
MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for Speech Recognition [75.12948999653338]
We propose a novel multi-task encoder-decoder pre-training framework (MMSpeech) for Mandarin automatic speech recognition (ASR) We employ a multi-task learning framework including five self-supervised and supervised tasks with speech and text data. Experiments on AISHELL-1 show that our proposed method achieves state-of-the-art performance, with a more than 40% relative improvement compared with other pre-training methods.
arXiv Detail & Related papers (2022-11-29T13:16:09Z)
Differentiable Allophone Graphs for Language-Universal Speech Recognition [77.2981317283029]
Building language-universal speech recognition systems entails producing phonological units of spoken sound that can be shared across languages. We present a general framework to derive phone-level supervision from only phonemic transcriptions and phone-to-phoneme mappings. We build a universal phone-based speech recognition model with interpretable probabilistic phone-to-phoneme mappings for each language.
arXiv Detail & Related papers (2021-07-24T15:09:32Z)
Phoneme Recognition through Fine Tuning of Phonetic Representations: a Case Study on Luhya Language Varieties [77.2347265289855]
We focus on phoneme recognition using Allosaurus, a method for multilingual recognition based on phonetic annotation. To evaluate in a challenging real-world scenario, we curate phone recognition datasets for Bukusu and Saamia, two varieties of the Luhya language cluster of western Kenya and eastern Uganda. We find that fine-tuning of Allosaurus, even with just 100 utterances, leads to significant improvements in phone error rates.
arXiv Detail & Related papers (2021-04-04T15:07:55Z)
Learning Explicit Prosody Models and Deep Speaker Embeddings for Atypical Voice Conversion [60.808838088376675]
We propose a VC system with explicit prosodic modelling and deep speaker embedding learning. A prosody corrector takes in phoneme embeddings to infer typical phoneme duration and pitch values. A conversion model takes phoneme embeddings and typical prosody features as inputs to generate the converted speech.
arXiv Detail & Related papers (2020-11-03T13:08:53Z)
Universal Phone Recognition with a Multilingual Allophone System [135.2254086165086]
We propose a joint model of language-independent phone and language-dependent phoneme distributions. In multilingual ASR experiments over 11 languages, we find that this model improves testing performance by 2% phoneme error rate absolute. Our recognizer achieves phone accuracy improvements of more than 17%, moving a step closer to speech recognition for all languages in the world.
arXiv Detail & Related papers (2020-02-26T21:28:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.