Shared Latent Representation for Joint Text-to-Audio-Visual Synthesis
- URL: http://arxiv.org/abs/2511.05432v1
- Date: Fri, 07 Nov 2025 17:07:56 GMT
- Title: Shared Latent Representation for Joint Text-to-Audio-Visual Synthesis
- Authors: Dogucan Yaman, Seymanur Akti, Fevziye Irem Eyiokur, Alexander Waibel,
- Abstract summary: A Text-to-Vec module generates Wav2Vec2 embeddings from text, which jointly condition speech and face generation.<n>We adopt a two-stage training: pretraining on Wav2Vec2 embeddings and finetuning on TTS outputs.<n>Experiments show that conditioning on TTS-predicted latent features outperforms cascaded pipelines.
- Score: 57.5830191022097
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a text-to-talking-face synthesis framework leveraging latent speech representations from HierSpeech++. A Text-to-Vec module generates Wav2Vec2 embeddings from text, which jointly condition speech and face generation. To handle distribution shifts between clean and TTS-predicted features, we adopt a two-stage training: pretraining on Wav2Vec2 embeddings and finetuning on TTS outputs. This enables tight audio-visual alignment, preserves speaker identity, and produces natural, expressive speech and synchronized facial motion without ground-truth audio at inference. Experiments show that conditioning on TTS-predicted latent features outperforms cascaded pipelines, improving both lip-sync and visual realism.
Related papers
- SpeechOp: Inference-Time Task Composition for Generative Speech Processing [41.5053493629172]
SpeechOp is a universal speech processor capable of performing a wide range of speech tasks.<n>Implicit Task Composition helps SpeechOp's enhancement via our principled inference-time task composition.
arXiv Detail & Related papers (2025-09-17T05:05:55Z) - Text2Lip: Progressive Lip-Synced Talking Face Generation from Text via Viseme-Guided Rendering [53.2204901422631]
Text2Lip is a viseme-centric framework that constructs an interpretable phonetic-visual bridge.<n>We show that Text2Lip outperforms existing approaches in semantic fidelity, visual realism, and modality robustness.
arXiv Detail & Related papers (2025-08-04T12:50:22Z) - Revival with Voice: Multi-modal Controllable Text-to-Speech Synthesis [52.25128289155576]
This paper explores multi-modal controllable Text-to-Speech Synthesis (TTS) where the voice can be generated from face image.<n>We aim to mitigate the following three challenges in face-driven TTS systems.<n> Experimental results validate the proposed model's effectiveness in face-driven voice synthesis.
arXiv Detail & Related papers (2025-05-25T04:43:17Z) - GOAT-TTS: Expressive and Realistic Speech Generation via A Dual-Branch LLM [42.93855899824886]
We propose a text-to-speech generation approach optimized via a novel dual-branch ArchiTecture (GOAT-TTS)<n>GOAT-TTS combines a speech encoder and projector to capture continuous acoustic embeddings, enabling bidirectional correlation between paralinguistic features (language, timbre, emotion) and semantic text representations without transcript dependency.<n> Experimental results demonstrate that our GOAT-TTS achieves performance comparable to state-of-the-art TTS models.
arXiv Detail & Related papers (2025-04-15T01:44:56Z) - SupertonicTTS: Towards Highly Efficient and Streamlined Text-to-Speech System [10.156665325783583]
SupertonicTTS is a novel text-to-speech (TTS) system designed for efficient and streamlined speech synthesis.<n>We employ a low-dimensional latent space, temporal compression of latents, and ConvNeXt blocks to enable a lightweight architecture.<n> Experimental results demonstrate that SupertonicTTS delivers performance comparable to contemporary zero-shot TTS models with only 44M parameters.
arXiv Detail & Related papers (2025-03-29T14:59:32Z) - Utilizing Neural Transducers for Two-Stage Text-to-Speech via Semantic
Token Prediction [15.72317249204736]
We propose a novel text-to-speech (TTS) framework centered around a neural transducer.
Our approach divides the whole TTS pipeline into semantic-level sequence-to-sequence (seq2seq) modeling and fine-grained acoustic modeling stages.
Our experimental results on zero-shot adaptive TTS demonstrate that our model surpasses the baseline in terms of speech quality and speaker similarity.
arXiv Detail & Related papers (2024-01-03T02:03:36Z) - Cross-Utterance Conditioned VAE for Speech Generation [27.5887600344053]
We present the Cross-Utterance Conditioned Variational Autoencoder speech synthesis (CUC-VAE S2) framework to enhance prosody and ensure natural speech generation.
We propose two practical algorithms tailored for distinct speech synthesis applications: CUC-VAE TTS for text-to-speech and CUC-VAE SE for speech editing.
arXiv Detail & Related papers (2023-09-08T06:48:41Z) - DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech [30.110058338155675]
Cross-lingual text-to-speech (CTTS) is still far from satisfactory as it is difficult to accurately retain the speaker timbres.
We propose a novel dual speaker embedding TTS (DSE-TTS) framework for CTTS with authentic speaking style.
By combining both embeddings, DSE-TTS significantly outperforms the state-of-the-art SANE-TTS in cross-lingual synthesis.
arXiv Detail & Related papers (2023-06-25T06:46:36Z) - Visual-Aware Text-to-Speech [101.89332968344102]
We present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and visual feedback of the listener in face-to-face communication.
We devise a baseline model to fuse phoneme linguistic information and listener visual signals for speech synthesis.
arXiv Detail & Related papers (2023-06-21T05:11:39Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - VisualTTS: TTS with Accurate Lip-Speech Synchronization for Automatic
Voice Over [68.22776506861872]
We formulate a novel task to synthesize speech in sync with a silent pre-recorded video, denoted as automatic voice over (AVO)
A natural solution to AVO is to condition the speech rendering on the temporal progression of lip sequence in the video.
We propose a novel text-to-speech model that is conditioned on visual input, named VisualTTS, for accurate lip-speech synchronization.
arXiv Detail & Related papers (2021-10-07T11:25:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.