Text-driven Talking Face Synthesis by Reprogramming Audio-driven Models
- URL: http://arxiv.org/abs/2306.16003v2
- Date: Thu, 18 Jan 2024 08:31:46 GMT
- Title: Text-driven Talking Face Synthesis by Reprogramming Audio-driven Models
- Authors: Jeongsoo Choi, Minsu Kim, Se Jin Park, Yong Man Ro
- Abstract summary: We present a method for reprogramming pre-trained audio-driven talking face synthesis models to operate in a text-driven manner.
We can easily generate face videos that articulate the provided textual sentences.
- Score: 64.14812728562596
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present a method for reprogramming pre-trained audio-driven
talking face synthesis models to operate in a text-driven manner. Consequently,
we can easily generate face videos that articulate the provided textual
sentences, eliminating the necessity of recording speech for each inference, as
required in the audio-driven model. To this end, we propose to embed the input
text into the learned audio latent space of the pre-trained audio-driven model,
while preserving the face synthesis capability of the original pre-trained
model. Specifically, we devise a Text-to-Audio Embedding Module (TAEM) which
maps a given text input into the audio latent space by modeling pronunciation
and duration characteristics. Furthermore, to consider the speaker
characteristics in audio while using text inputs, TAEM is designed to accept a
visual speaker embedding. The visual speaker embedding is derived from a single
target face image and enables improved mapping of input text to the learned
audio latent space by incorporating the speaker characteristics inherent in the
audio. The main advantages of the proposed framework are that 1) it can be
applied to diverse audio-driven talking face synthesis models and 2) we can
generate talking face videos with either text inputs or audio inputs with high
flexibility.
Related papers
- Neural Text to Articulate Talk: Deep Text to Audiovisual Speech
Synthesis achieving both Auditory and Photo-realism [26.180371869137257]
State of the art in talking face generation focuses mainly on lip-syncing, being conditioned on audio clips.
NEUral Text to ARticulate Talk (NEUTART) is a talking face generator that uses a joint audiovisual feature space.
Model produces photorealistic talking face videos with human-like articulation and well-synced audiovisual streams.
arXiv Detail & Related papers (2023-12-11T18:41:55Z) - DiffV2S: Diffusion-based Video-to-Speech Synthesis with Vision-guided
Speaker Embedding [52.84475402151201]
We present a vision-guided speaker embedding extractor using a self-supervised pre-trained model and prompt tuning technique.
We further develop a diffusion-based video-to-speech synthesis model, so called DiffV2S, conditioned on those speaker embeddings and the visual representation extracted from the input video.
Our experimental results show that DiffV2S achieves the state-of-the-art performance compared to the previous video-to-speech synthesis technique.
arXiv Detail & Related papers (2023-08-15T14:07:41Z) - Visual-Aware Text-to-Speech [101.89332968344102]
We present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and visual feedback of the listener in face-to-face communication.
We devise a baseline model to fuse phoneme linguistic information and listener visual signals for speech synthesis.
arXiv Detail & Related papers (2023-06-21T05:11:39Z) - CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained
Language-Vision Models [50.42886595228255]
We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge.
We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining model.
arXiv Detail & Related papers (2023-06-16T05:42:01Z) - Speech inpainting: Context-based speech synthesis guided by video [29.233167442719676]
This paper focuses on the problem of audio-visual speech inpainting, which is the task of synthesizing the speech in a corrupted audio segment.
We present an audio-visual transformer-based deep learning model that leverages visual cues that provide information about the content of the corrupted audio.
We also show how visual features extracted with AV-HuBERT, a large audio-visual transformer for speech recognition, are suitable for synthesizing speech.
arXiv Detail & Related papers (2023-06-01T09:40:47Z) - VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for
Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model)
The proposed VATLM employs a unified backbone network to model the modality-independent information.
In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - A$^3$T: Alignment-Aware Acoustic and Text Pretraining for Speech
Synthesis and Editing [31.666920933058144]
We propose our framework, Alignment-Aware Acoustic-Text Pretraining (A$3$T), which reconstructs masked acoustic signals with text input and acoustic-text alignment during training.
Experiments show A$3$T outperforms SOTA models on speech editing, and improves multi-speaker speech synthesis without the external speaker verification model.
arXiv Detail & Related papers (2022-03-18T01:36:25Z) - Audiovisual Speech Synthesis using Tacotron2 [14.206988023567828]
We propose and compare two audiovisual speech synthesis systems for 3D face models.
AVTacotron2 is an end-to-end text-to-audiovisual speech synthesizer based on the Tacotron2 architecture.
The second audiovisual speech synthesis system is modular, where acoustic speech is synthesized from text using the traditional Tacotron2.
arXiv Detail & Related papers (2020-08-03T02:45:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.