Related papers: A Bridge from Audio to Video: Phoneme-Viseme Alignment Allows Every Face to Speak Multiple Languages

A Bridge from Audio to Video: Phoneme-Viseme Alignment Allows Every Face to Speak Multiple Languages

URL: http://arxiv.org/abs/2510.06612v1
Date: Wed, 08 Oct 2025 03:46:39 GMT
Title: A Bridge from Audio to Video: Phoneme-Viseme Alignment Allows Every Face to Speak Multiple Languages
Authors: Zibo Su, Kun Wei, Jiahua Li, Xu Yang, Cheng Deng,
Abstract summary: Speech-driven talking face synthesis (TFS) focuses on generating facial animations from audio input.<n>Current models perform well in English but unsatisily in non-English languages, producing wrong mouth shapes and rigid facial expressions.<n>We propose Multilingual Experts (MuEx), a novel framework featuring a Phoneme-Guided Mixture-of-Experts architecture.
Score: 60.81571443992153
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Speech-driven talking face synthesis (TFS) focuses on generating lifelike facial animations from audio input. Current TFS models perform well in English but unsatisfactorily in non-English languages, producing wrong mouth shapes and rigid facial expressions. The terrible performance is caused by the English-dominated training datasets and the lack of cross-language generalization abilities. Thus, we propose Multilingual Experts (MuEx), a novel framework featuring a Phoneme-Guided Mixture-of-Experts (PG-MoE) architecture that employs phonemes and visemes as universal intermediaries to bridge audio and video modalities, achieving lifelike multilingual TFS. To alleviate the influence of linguistic differences and dataset bias, we extract audio and video features as phonemes and visemes respectively, which are the basic units of speech sounds and mouth movements. To address audiovisual synchronization issues, we introduce the Phoneme-Viseme Alignment Mechanism (PV-Align), which establishes robust cross-modal correspondences between phonemes and visemes. In addition, we build a Multilingual Talking Face Benchmark (MTFB) comprising 12 diverse languages with 95.04 hours of high-quality videos for training and evaluating multilingual TFS performance. Extensive experiments demonstrate that MuEx achieves superior performance across all languages in MTFB and exhibits effective zero-shot generalization to unseen languages without additional training.

Related papers

TalkCuts: A Large-Scale Dataset for Multi-Shot Human Speech Video Generation [76.48551690189406]
We present TalkCuts, a large-scale dataset designed to facilitate the study of multi-shot human speech video generation.<n>TalkCuts offers 164k clips totaling over 500 hours of high-quality human speech videos with diverse camera shots, including close-up, half-body, and full-body views.<n>The dataset includes detailed textual descriptions, 2D keypoints and 3D SMPL-X motion annotations, covering over 10k identities, enabling multimodal learning and evaluation.
arXiv Detail & Related papers (2025-10-08T17:16:09Z)
PART: Progressive Alignment Representation Training for Multilingual Speech-To-Text with LLMs [58.2469845374385]
We introduce Progressive Alignment Representation Training (PART)<n>PART is a multi-stage and multi-task framework that separates within-language from cross-language alignment.<n>Experiments on CommonVoice 15, Fleurs, Wenetspeech, and CoVoST2 show that PART surpasses conventional approaches.
arXiv Detail & Related papers (2025-09-24T03:54:14Z)
Generalized Multilingual Text-to-Speech Generation with Language-Aware Style Adaptation [18.89091877062589]
LanStyleTTS is a non-autoregressive, language-aware style adaptive TTS framework.<n>It supports a unified multilingual TTS model capable of producing accurate and high-quality speech without the need to train language-specific models.
arXiv Detail & Related papers (2025-04-11T06:12:57Z)
VQTalker: Towards Multilingual Talking Avatars through Facial Motion Tokenization [20.728919218746363]
VQTalker is a Vector Quantization-based framework for multilingual talking head generation.<n>Our approach is grounded in the phonetic principle that human speech comprises a finite set of distinct sound units.<n>VQTalker achieves state-of-the-art performance in both video-driven and speech-driven scenarios.
arXiv Detail & Related papers (2024-12-13T06:14:57Z)
Unified Video-Language Pre-training with Synchronized Audio [21.607860535968356]
We propose an enhanced framework for Video-Language pre-training with Synchronized Audio. Our framework learns tri-modal representations in a unified self-supervised transformer. Our model pre-trained on only 0.9M data achieves improving results against state-of-the-art baselines.
arXiv Detail & Related papers (2024-05-12T07:59:46Z)
MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition [51.412413996510814]
We propose MixSpeech, a cross-modality self-learning framework that utilizes audio speech to regularize the training of visual speech tasks. MixSpeech enhances speech translation in noisy environments, improving BLEU scores for four languages on AVMuST-TED by +1.4 to +4.2.
arXiv Detail & Related papers (2023-03-09T14:58:29Z)
ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks. We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes. Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z)
That Sounds Familiar: an Analysis of Phonetic Representations Transfer Across Languages [72.9927937955371]
We use the resources existing in other languages to train a multilingual automatic speech recognition model. We observe significant improvements across all languages in the multilingual setting, and stark degradation in the crosslingual setting. Our analysis uncovered that even the phones that are unique to a single language can benefit greatly from adding training data from other languages.
arXiv Detail & Related papers (2020-05-16T22:28:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.