Revival with Voice: Multi-modal Controllable Text-to-Speech Synthesis
- URL: http://arxiv.org/abs/2505.18972v1
- Date: Sun, 25 May 2025 04:43:17 GMT
- Title: Revival with Voice: Multi-modal Controllable Text-to-Speech Synthesis
- Authors: Minsu Kim, Pingchuan Ma, Honglie Chen, Stavros Petridis, Maja Pantic,
- Abstract summary: This paper explores multi-modal controllable Text-to-Speech Synthesis (TTS) where the voice can be generated from face image.<n>We aim to mitigate the following three challenges in face-driven TTS systems.<n> Experimental results validate the proposed model's effectiveness in face-driven voice synthesis.
- Score: 52.25128289155576
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This paper explores multi-modal controllable Text-to-Speech Synthesis (TTS) where the voice can be generated from face image, and the characteristics of output speech (e.g., pace, noise level, distance, tone, place) can be controllable with natural text description. Specifically, we aim to mitigate the following three challenges in face-driven TTS systems. 1) To overcome the limited audio quality of audio-visual speech corpora, we propose a training method that additionally utilizes high-quality audio-only speech corpora. 2) To generate voices not only from real human faces but also from artistic portraits, we propose augmenting the input face image with stylization. 3) To consider one-to-many possibilities in face-to-voice mapping and ensure consistent voice generation at the same time, we propose to first employ sampling-based decoding and then use prompting with generated speech samples. Experimental results validate the proposed model's effectiveness in face-driven voice synthesis.
Related papers
- Face2VoiceSync: Lightweight Face-Voice Consistency for Text-Driven Talking Face Generation [14.036076647627553]
Given a face image and text to speak, we generate talking face animation and its corresponding speeches.<n>We propose a novel framework, Face2VoiceSync, with several novel contributions.<n> Experiments show Face2VoiceSync achieves both visual and audio state-of-the-art performances on a single 40GB GPU.
arXiv Detail & Related papers (2025-07-25T12:49:06Z) - Faces that Speak: Jointly Synthesising Talking Face and Speech from Text [22.87082439322244]
We achieve this by integrating Talking Face Generation (TFG) and Text-to-Speech (TTS) systems into a unified framework.
We address the main challenges of each task: (1) generating a range of head poses representative of real-world scenarios, and (2) ensuring voice consistency despite variations in facial motion for the same identity.
Our experiments demonstrate that our method effectively creates natural-looking talking faces and speech that accurately match the input text.
arXiv Detail & Related papers (2024-05-16T17:29:37Z) - SpeechX: Neural Codec Language Model as a Versatile Speech Transformer [57.82364057872905]
SpeechX is a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks.
Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise.
arXiv Detail & Related papers (2023-08-14T01:01:19Z) - Text-driven Talking Face Synthesis by Reprogramming Audio-driven Models [64.14812728562596]
We present a method for reprogramming pre-trained audio-driven talking face synthesis models to operate in a text-driven manner.
We can easily generate face videos that articulate the provided textual sentences.
arXiv Detail & Related papers (2023-06-28T08:22:53Z) - Visual-Aware Text-to-Speech [101.89332968344102]
We present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and visual feedback of the listener in face-to-face communication.
We devise a baseline model to fuse phoneme linguistic information and listener visual signals for speech synthesis.
arXiv Detail & Related papers (2023-06-21T05:11:39Z) - Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis [66.43223397997559]
We aim to synthesize high-quality talking portrait videos corresponding to the input text.
This task has broad application prospects in the digital human industry but has not been technically achieved yet.
We introduce Adaptive Text-to-Talking Avatar (Ada-TTA), which designs a generic zero-shot multi-speaker Text-to-Speech model.
arXiv Detail & Related papers (2023-06-06T08:50:13Z) - Make-A-Voice: Unified Voice Synthesis With Discrete Representation [77.3998611565557]
Make-A-Voice is a unified framework for synthesizing and manipulating voice signals from discrete representations.
We show that Make-A-Voice exhibits superior audio quality and style similarity compared with competitive baseline models.
arXiv Detail & Related papers (2023-05-30T17:59:26Z) - Zero-shot personalized lip-to-speech synthesis with face image based
voice control [41.17483247506426]
Lip-to-Speech (Lip2Speech) synthesis, which predicts corresponding speech from talking face images, has witnessed significant progress with various models and training strategies.
We propose a zero-shot personalized Lip2Speech synthesis method, in which face images control speaker identities.
arXiv Detail & Related papers (2023-05-09T02:37:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.