LTA-L2S: Lexical Tone-Aware Lip-to-Speech Synthesis for Mandarin with Cross-Lingual Transfer Learning
- URL: http://arxiv.org/abs/2509.25670v1
- Date: Tue, 30 Sep 2025 02:13:55 GMT
- Title: LTA-L2S: Lexical Tone-Aware Lip-to-Speech Synthesis for Mandarin with Cross-Lingual Transfer Learning
- Authors: Kang Yang, Yifan Liang, Fangkun Liu, Zhenping Xie, Chengshi Zheng,
- Abstract summary: We propose Lexical Tone-Aware Lip-to-Speech (LTA-L2S) for Mandarin.<n>Our model adapts an English pre-trained audio-visual self-supervised learning (SSL) model via a cross-lingual transfer learning strategy.<n>Experiments demonstrate LTA-L2S significantly outperforms existing methods in both speech intelligibility and tonal accuracy.
- Score: 17.450358796576225
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Lip-to-speech (L2S) synthesis for Mandarin is a significant challenge, hindered by complex viseme-to-phoneme mappings and the critical role of lexical tones in intelligibility. To address this issue, we propose Lexical Tone-Aware Lip-to-Speech (LTA-L2S). To tackle viseme-to-phoneme complexity, our model adapts an English pre-trained audio-visual self-supervised learning (SSL) model via a cross-lingual transfer learning strategy. This strategy not only transfers universal knowledge learned from extensive English data to the Mandarin domain but also circumvents the prohibitive cost of training such a model from scratch. To specifically model lexical tones and enhance intelligibility, we further employ a flow-matching model to generate the F0 contour. This generation process is guided by ASR-fine-tuned SSL speech units, which contain crucial suprasegmental information. The overall speech quality is then elevated through a two-stage training paradigm, where a flow-matching postnet refines the coarse spectrogram from the first stage. Extensive experiments demonstrate that LTA-L2S significantly outperforms existing methods in both speech intelligibility and tonal accuracy.
Related papers
- Lost in Translation, Found in Embeddings: Sign Language Translation and Alignment [84.39962912136525]
We develop a model for sign language understanding that performs sign language translation (SLT) and sign-subtitle alignment (SSA)<n>Our approach is built upon three components: (i) a lightweight visual backbone that captures manual and non-manual cues from human keypoints and lip-region images; (ii) a Sliding Perceiver mapping network that aggregates consecutive visual features into word-level embeddings; and (iii) a multi-task scalable training strategy that jointly optimises SLT and SSA.
arXiv Detail & Related papers (2025-12-08T21:05:46Z) - Towards Unsupervised Speech Recognition at the Syllable-Level [95.54031547995874]
We introduce a syllable-level UASR framework based on masked language modeling.<n>We generalize effectively to Mandarin, a language that has remained particularly difficult for prior methods.
arXiv Detail & Related papers (2025-10-04T02:56:33Z) - Session-Level Spoken Language Assessment with a Multimodal Foundation Model via Multi-Target Learning [8.717610965852037]
Spoken Language Assessment (SLA) estimates a learner's oral proficiency from spontaneous speech.<n>This paper introduces a novel multimodal foundation model approach that performs session-level evaluation in a single pass.
arXiv Detail & Related papers (2025-09-19T14:33:05Z) - ProsodyLM: Uncovering the Emerging Prosody Processing Capabilities in Speech Language Models [70.56468982313834]
We propose ProsodyLM, which introduces a simple tokenization scheme amenable to learning prosody.<n>We find that ProsodyLM can learn surprisingly diverse emerging prosody processing capabilities through pre-training alone.
arXiv Detail & Related papers (2025-07-27T00:59:01Z) - CCL-XCoT: An Efficient Cross-Lingual Knowledge Transfer Method for Mitigating Hallucination Generation [23.610002725335313]
Large Language Models (MLLMs) demonstrate strong generalization across languages, yet they remain prone to hallucinations, especially in low-resource languages.<n>We propose CCL-XCoT, a two-stage fine-tuning framework for mitigating hallucination in MLLMs.<n> Experimental results show that CCL-XCoT reduces hallucination rates by up to 62% and substantially improves factual knowledge transfer across language pairs.
arXiv Detail & Related papers (2025-07-17T14:25:24Z) - A dual task learning approach to fine-tune a multilingual semantic speech encoder for Spoken Language Understanding [12.887586659035497]
Self-Supervised Learning is vastly used to efficiently represent speech for Spoken Language Understanding.
textual SSL models are proposed to encode language-agnostic semantics.
SAMU-XLSR framework employed this semantic information to enrich multilingual speech representations.
arXiv Detail & Related papers (2024-06-17T23:07:53Z) - Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer [53.72998363956454]
Direct speech-to-speech translation (S2ST) with discrete self-supervised representations has achieved remarkable accuracy.
The scarcity of high-quality speaker-parallel data poses a challenge for learning style transfer during translation.
We design an S2ST pipeline with style-transfer capability on the basis of discrete self-supervised speech representations and timbre units.
arXiv Detail & Related papers (2023-09-14T09:52:08Z) - Incorporating L2 Phonemes Using Articulatory Features for Robust Speech
Recognition [2.8360662552057323]
This study is on the efficient incorporation of the L2 phonemes, which in this work refer to Korean phonemes, through articulatory feature analysis.
We employ the lattice-free maximum mutual information (LF-MMI) objective in an end-to-end manner, to train the acoustic model to align and predict one of multiple pronunciation candidates.
Experimental results show that the proposed method improves ASR accuracy for Korean L2 speech by training solely on L1 speech data.
arXiv Detail & Related papers (2023-06-05T01:55:33Z) - The Interpreter Understands Your Meaning: End-to-end Spoken Language
Understanding Aided by Speech Translation [13.352795145385645]
Speech translation (ST) is a good means of pretraining speech models for end-to-end spoken language understanding.
We show that our models reach higher performance over baselines on monolingual and multilingual intent classification.
We also create new benchmark datasets for speech summarization and low-resource/zero-shot transfer from English to French or Spanish.
arXiv Detail & Related papers (2023-05-16T17:53:03Z) - Lip-to-Speech Synthesis in the Wild with Multi-task Learning [32.65865343643458]
We develop a powerful Lip2Speech method that can reconstruct speech with correct contents from the input lip movements, even in a wild environment.
We design multi-task learning that guides the model using multimodal supervision, i.e., text and audio, to complement the insufficient word representations of acoustic feature reconstruction loss.
arXiv Detail & Related papers (2023-02-17T12:31:26Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.