SyncVoice: Towards Video Dubbing with Vision-Augmented Pretrained TTS Model
- URL: http://arxiv.org/abs/2512.05126v1
- Date: Sun, 23 Nov 2025 16:51:05 GMT
- Title: SyncVoice: Towards Video Dubbing with Vision-Augmented Pretrained TTS Model
- Authors: Kaidi Wang, Yi He, Wenhao Guan, Weijie Wu, Hongwu Ding, Xiong Zhang, Di Wu, Meng Meng, Jian Luan, Lin Li, Qingyang Hong,
- Abstract summary: Video dubbing aims to generate high-fidelity speech that is precisely temporally aligned with the visual content.<n>Existing methods still suffer from limitations in speech naturalness and audio-visual synchronization.<n>We propose SyncVoice, a vision-augmented video dubbing framework built upon a pretrained text-to-speech (TTS) model.
- Score: 34.874153953305346
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video dubbing aims to generate high-fidelity speech that is precisely temporally aligned with the visual content. Existing methods still suffer from limitations in speech naturalness and audio-visual synchronization, and are limited to monolingual settings. To address these challenges, we propose SyncVoice, a vision-augmented video dubbing framework built upon a pretrained text-to-speech (TTS) model. By fine-tuning the TTS model on audio-visual data, we achieve strong audiovisual consistency. We propose a Dual Speaker Encoder to effectively mitigate inter-language interference in cross-lingual speech synthesis and explore the application of video dubbing in video translation scenarios. Experimental results show that SyncVoice achieves high-fidelity speech generation with strong synchronization performance, demonstrating its potential in video dubbing tasks.
Related papers
- JUST-DUB-IT: Video Dubbing via Joint Audio-Visual Diffusion [47.70095297438178]
We introduce a single-model approach that adapts an audio-video diffusion model for video-to-video dubbing via a lightweight LoRA.<n>We generate multilingual videos with language switches within a single clip, and then inpaint the face and audio in each half to match the language of the other half.<n>We demonstrate that our approach produces high-quality dubbed videos with improved visual fidelity, lip synchronization, and robustness compared to existing dubbing pipelines.
arXiv Detail & Related papers (2026-01-29T18:57:13Z) - FunCineForge: A Unified Dataset Toolkit and Model for Zero-Shot Movie Dubbing in Diverse Cinematic Scenes [56.534404169212785]
FunCineForge comprises an end-to-end production pipeline for large-scale dubbing datasets and an MLLM-based dubbing model designed for diverse cinematic scenes.<n>We construct the first Chinese television dubbing dataset with rich annotations, and demonstrate the high quality of these data.<n>Experiments across monologue, narration, dialogue, and multi-speaker scenes show that our dubbing model consistently outperforms SOTA methods in audio quality, lip sync, timbre transfer, and instruction following.
arXiv Detail & Related papers (2026-01-21T08:57:00Z) - Shared Latent Representation for Joint Text-to-Audio-Visual Synthesis [57.5830191022097]
A Text-to-Vec module generates Wav2Vec2 embeddings from text, which jointly condition speech and face generation.<n>We adopt a two-stage training: pretraining on Wav2Vec2 embeddings and finetuning on TTS outputs.<n>Experiments show that conditioning on TTS-predicted latent features outperforms cascaded pipelines.
arXiv Detail & Related papers (2025-11-07T17:07:56Z) - VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models [43.1613638989795]
We present VoiceCraft-Dub, a novel approach for automated video dubbing that synthesizes high-quality speech from text and facial cues.<n>This task has broad applications in filmmaking, multimedia creation, and assisting voice-impaired individuals.
arXiv Detail & Related papers (2025-04-03T08:24:47Z) - SyncSpeech: Low-Latency and Efficient Dual-Stream Text-to-Speech based on Temporal Masked Transformer [68.78023656892319]
This paper presents a dual-stream text-to-speech (TTS) model, SyncSpeech, capable of receiving streaming text input from upstream models while simultaneously generating streaming speech.<n>SyncSpeech has the following advantages: Low latency, as it begins generating streaming speech upon receiving the second text token; High efficiency, as it decodes all speech tokens corresponding to the each arrived text token in one step.
arXiv Detail & Related papers (2025-02-16T12:14:17Z) - TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - Synchformer: Efficient Synchronization from Sparse Cues [100.89656994681934]
Our contributions include a novel audio-visual synchronization model, and training that decouples extraction from synchronization modelling.
This approach achieves state-of-the-art performance in both dense and sparse settings.
We also extend synchronization model training to AudioSet a million-scale 'in-the-wild' dataset, investigate evidence attribution techniques for interpretability, and explore a new capability for synchronization models: audio-visual synchronizability.
arXiv Detail & Related papers (2024-01-29T18:59:55Z) - More than Words: In-the-Wild Visually-Driven Prosody for Text-to-Speech [9.035846000646481]
Motivated by dubbing, VDTTS takes advantage of video frames as an additional input alongside text.
We demonstrate how this allows VDTTS to generate speech that not only has prosodic variations like natural pauses and pitch, but is also synchronized to the input video.
arXiv Detail & Related papers (2021-11-19T10:23:38Z) - Neural Dubber: Dubbing for Silent Videos According to Scripts [22.814626504851752]
We propose Neural Dubber, the first neural network model to solve a novel automatic video dubbing (AVD) task.
Neural Dubber is a multi-modal text-to-speech model that utilizes the lip movement in the video to control the prosody of the generated speech.
Experiments show that Neural Dubber can control the prosody of synthesized speech by the video, and generate high-fidelity speech temporally synchronized with the video.
arXiv Detail & Related papers (2021-10-15T17:56:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.