VALLR-Pin: Uncertainty-Factorized Visual Speech Recognition for Mandarin with Pinyin Guidance
- URL: http://arxiv.org/abs/2512.20032v2
- Date: Mon, 29 Dec 2025 00:34:57 GMT
- Title: VALLR-Pin: Uncertainty-Factorized Visual Speech Recognition for Mandarin with Pinyin Guidance
- Authors: Chang Sun, Dongliang Xie, Wanpeng Xie, Bo Qin, Hong Yang,
- Abstract summary: We propose VALLR-Pin, a two-stage Mandarin visual speech recognition framework.<n>VALLR-Pin explicitly incorporates Pinyin as an intermediate representation.<n>We show that VALLR-Pin consistently improves transcription accuracy under multi-speaker conditions.
- Score: 10.289249986948393
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual speech recognition (VSR) aims to transcribe spoken content from silent lip-motion videos and is particularly challenging in Mandarin due to severe viseme ambiguity and pervasive homophones. We propose VALLR-Pin, a two-stage Mandarin VSR framework that extends the VALLR architecture by explicitly incorporating Pinyin as an intermediate representation. In the first stage, a shared visual encoder feeds dual decoders that jointly predict Mandarin characters and their corresponding Pinyin sequences, encouraging more robust visual-linguistic representations. In the second stage, an LLM-based refinement module takes the predicted Pinyin sequence together with an N-best list of character hypotheses to resolve homophone-induced ambiguities. To further adapt the LLM to visual recognition errors, we fine-tune it on synthetic instruction data constructed from model-generated Pinyin-text pairs, enabling error-aware correction. Experiments on public Mandarin VSR benchmarks demonstrate that VALLR-Pin consistently improves transcription accuracy under multi-speaker conditions, highlighting the effectiveness of combining phonetic guidance with lightweight LLM refinement.
Related papers
- Transcribe, Translate, or Transliterate: An Investigation of Intermediate Representations in Spoken Language Models [68.69744941948986]
Spoken language models (SLMs) that integrate speech with large language models (LMs) rely on modality adapters (MAs) to map the output of speech encoders to a representation that is understandable to the decoder LM.<n>Here we examine the MA output representation in three SLMs (SALMONN, Qwen2-Audio and Phi-4-Multimodal-Instruct)<n>By finding the nearest decoder LM token to an MA representation, we uncover two strategies for MA representations.
arXiv Detail & Related papers (2025-10-02T21:19:40Z) - PART: Progressive Alignment Representation Training for Multilingual Speech-To-Text with LLMs [58.2469845374385]
We introduce Progressive Alignment Representation Training (PART)<n>PART is a multi-stage and multi-task framework that separates within-language from cross-language alignment.<n>Experiments on CommonVoice 15, Fleurs, Wenetspeech, and CoVoST2 show that PART surpasses conventional approaches.
arXiv Detail & Related papers (2025-09-24T03:54:14Z) - Bridging Sign and Spoken Languages: Pseudo Gloss Generation for Sign Language Translation [48.20483623444857]
Sign Language Translation aims to map sign language videos to spoken language text.<n>A common approach relies on gloss annotations as an intermediate representation.<n>We propose a gloss-free pseudo gloss generation framework that eliminates the need for human-annotated glosses.
arXiv Detail & Related papers (2025-05-21T12:19:55Z) - FlowDubber: Movie Dubbing with LLM-based Semantic-aware Learning and Flow Matching based Voice Enhancing [81.3306413498174]
Movie Dubbing aims to convert scripts into speeches that align with the given movie clip in both temporal and emotional aspects.<n>Existing methods focus primarily on reducing the word error rate while ignoring the importance of lip-sync and acoustic quality.<n>We propose a large language model (LLM) based flow matching architecture for dubbing, named FlowDubber.
arXiv Detail & Related papers (2025-05-02T13:30:19Z) - Disambiguation of Chinese Polyphones in an End-to-End Framework with Semantic Features Extracted by Pre-trained BERT [81.99600765234285]
We propose an end-to-end framework to predict the pronunciation of a polyphonic character.<n>The proposed method consists of a pre-trained bidirectional encoder representations from Transformers (BERT) model and a neural network (NN) based classifier.
arXiv Detail & Related papers (2025-01-02T06:51:52Z) - Large Language Model Should Understand Pinyin for Chinese ASR Error Correction [31.13523648668466]
We propose Pinyin-enhanced GEC to improve Chinese ASR error correction.
Our approach only utilizes synthetic errors for training and employs the one-best hypothesis during inference.
Experiments on the Aishell-1 and the Common Voice datasets demonstrate that our approach consistently outperforms GEC with text-only input.
arXiv Detail & Related papers (2024-09-20T06:50:56Z) - Pinyin Regularization in Error Correction for Chinese Speech Recognition with Large Language Models [11.287933170894311]
We construct a specialized benchmark dataset aimed at error correction for Chinese ASR with 724K hypotheses-transcription pairs.
We propose a method of Pinyin regularization for prompts, which involves the transcription of Pinyin directly from text hypotheses.
arXiv Detail & Related papers (2024-07-02T03:16:47Z) - L1-aware Multilingual Mispronunciation Detection Framework [10.15106073866792]
This paper introduces a novel multilingual MDD architecture, L1-MultiMDD, enriched with L1-aware speech representation.
An end-to-end speech encoder is trained on the input signal and its corresponding reference phoneme sequence.
Experiments demonstrate the effectiveness of the proposed L1-MultiMDD framework on both seen -- L2-ARTIC, LATIC, and AraVoiceL2v2; and unseen -- EpaDB and Speechocean762 datasets.
arXiv Detail & Related papers (2023-09-14T13:53:17Z) - Is context all you need? Scaling Neural Sign Language Translation to
Large Domains of Discourse [34.70927441846784]
Sign Language Translation (SLT) is a challenging task that aims to generate spoken language sentences from sign language videos.
We propose a novel multi-modal transformer architecture that tackles the translation task in a context-aware manner, as a human would.
We report significant improvements on state-of-the-art translation performance using contextual information, nearly doubling the reported BLEU-4 scores of baseline approaches.
arXiv Detail & Related papers (2023-08-18T15:27:22Z) - Enhancing Cross-lingual Transfer via Phonemic Transcription Integration [57.109031654219294]
PhoneXL is a framework incorporating phonemic transcriptions as an additional linguistic modality for cross-lingual transfer.
Our pilot study reveals phonemic transcription provides essential information beyond the orthography to enhance cross-lingual transfer.
arXiv Detail & Related papers (2023-07-10T06:17:33Z) - Incorporating L2 Phonemes Using Articulatory Features for Robust Speech
Recognition [2.8360662552057323]
This study is on the efficient incorporation of the L2 phonemes, which in this work refer to Korean phonemes, through articulatory feature analysis.
We employ the lattice-free maximum mutual information (LF-MMI) objective in an end-to-end manner, to train the acoustic model to align and predict one of multiple pronunciation candidates.
Experimental results show that the proposed method improves ASR accuracy for Korean L2 speech by training solely on L1 speech data.
arXiv Detail & Related papers (2023-06-05T01:55:33Z) - Cross-modality Data Augmentation for End-to-End Sign Language Translation [66.46877279084083]
End-to-end sign language translation (SLT) aims to convert sign language videos into spoken language texts directly without intermediate representations.
It has been a challenging task due to the modality gap between sign videos and texts and the data scarcity of labeled data.
We propose a novel Cross-modality Data Augmentation (XmDA) framework to transfer the powerful gloss-to-text translation capabilities to end-to-end sign language translation.
arXiv Detail & Related papers (2023-05-18T16:34:18Z) - Dual-Decoder Transformer For end-to-end Mandarin Chinese Speech
Recognition with Pinyin and Character [15.999657143705045]
Pinyin and character as writing and spelling systems respectively are mutual promotion in the Mandarin Chinese language.
We propose a novel Mandarin Chinese ASR model with dual-decoder Transformer according to the characteristics of pinyin transcripts and character transcripts.
The results on the test sets of AISHELL-1 dataset show that the proposed Speech-Pinyin-Character-Interaction (S PCI) model without a language model achieves 9.85% character error rate (CER) on the test set.
arXiv Detail & Related papers (2022-01-26T07:59:03Z) - SHUOWEN-JIEZI: Linguistically Informed Tokenizers For Chinese Language
Model Pretraining [48.880840711568425]
We study the influences of three main factors on the Chinese tokenization for pretrained language models.
We propose three kinds of tokenizers: SHUOWEN (meaning Talk Word), the pronunciation-based tokenizers; 2) JIEZI (meaning Solve Character), the glyph-based tokenizers.
We find that SHUOWEN and JIEZI tokenizers can generally outperform conventional single-character tokenizers.
arXiv Detail & Related papers (2021-06-01T11:20:02Z) - Non-autoregressive Mandarin-English Code-switching Speech Recognition
with Pinyin Mask-CTC and Word Embedding Regularization [61.749126838659315]
Mandarin-English code-switching (CS) is frequently used among East and Southeast Asian people.
Recent successful non-autoregressive (NAR) ASR models remove the need for left-to-right beam decoding in autoregressive (AR) models.
We propose changing the Mandarin output target of the encoder to Pinyin for faster encoder training, and introduce Pinyin-to-Mandarin decoder to learn contextualized information.
arXiv Detail & Related papers (2021-04-06T03:01:09Z) - Polyphone Disambiguation in Mandarin Chinese with Semi-Supervised Learning [9.13211149475579]
The majority of Chinese characters are monophonic, while a special group of characters, called polyphonic characters, have multiple pronunciations.
As a prerequisite of performing speech-related generative tasks, the correct pronunciation must be identified among several candidates.
We propose a novel semi-supervised learning framework for Mandarin Chinese polyphone disambiguation.
arXiv Detail & Related papers (2021-02-01T03:47:59Z) - g2pM: A Neural Grapheme-to-Phoneme Conversion Package for Mandarin
Chinese Based on a New Open Benchmark Dataset [14.323478990713477]
We introduce a new benchmark dataset that consists of 99,000+ sentences for Chinese polyphone disambiguation.
We train a simple neural network model on it, and find that it outperforms other preexisting G2P systems.
arXiv Detail & Related papers (2020-04-07T05:44:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.