Text-to-Video: a Two-stage Framework for Zero-shot Identity-agnostic
Talking-head Generation
- URL: http://arxiv.org/abs/2308.06457v1
- Date: Sat, 12 Aug 2023 03:30:49 GMT
- Title: Text-to-Video: a Two-stage Framework for Zero-shot Identity-agnostic
Talking-head Generation
- Authors: Zhichao Wang, Mengyu Dai, Keld Lundgaard
- Abstract summary: We propose a novel two-stage framework for person-agnostic video cloning.
In the first stage, we leverage pretrained zero-shot models to achieve text-to-speech conversion.
In the second stage, an audio-driven talking head generation method is employed to produce compelling videos.
- Score: 16.12424393291571
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The advent of ChatGPT has introduced innovative methods for information
gathering and analysis. However, the information provided by ChatGPT is limited
to text, and the visualization of this information remains constrained.
Previous research has explored zero-shot text-to-video (TTV) approaches to
transform text into videos. However, these methods lacked control over the
identity of the generated audio, i.e., not identity-agnostic, hindering their
effectiveness. To address this limitation, we propose a novel two-stage
framework for person-agnostic video cloning, specifically focusing on TTV
generation. In the first stage, we leverage pretrained zero-shot models to
achieve text-to-speech (TTS) conversion. In the second stage, an audio-driven
talking head generation method is employed to produce compelling videos
privided the audio generated in the first stage. This paper presents a
comparative analysis of different TTS and audio-driven talking head generation
methods, identifying the most promising approach for future research and
development. Some audio and videos samples can be found in the following link:
https://github.com/ZhichaoWang970201/Text-to-Video/tree/main.
Related papers
- Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition [72.22243595269389]
We introduce Audio-Agent, a framework for audio generation, editing and composition based on text or video inputs.
For video-to-audio (VTA) tasks, most existing methods require training a timestamp detector to synchronize video events with generated audio.
arXiv Detail & Related papers (2024-10-04T11:40:53Z) - Text-to-Audio Generation Synchronized with Videos [44.848393652233796]
We introduce a groundbreaking benchmark for Text-to-Audio generation that aligns with Videos, named T2AV-Bench.
We also present a simple yet effective video-aligned TTA generation model, namely T2AV.
It employs a temporal multi-head attention transformer to extract and understand temporal nuances from video data, a feat amplified by our Audio-Visual ControlNet.
arXiv Detail & Related papers (2024-03-08T22:27:38Z) - TextrolSpeech: A Text Style Control Speech Corpus With Codec Language
Text-to-Speech Models [51.529485094900934]
We propose TextrolSpeech, which is the first large-scale speech emotion dataset annotated with rich text attributes.
We introduce a multi-stage prompt programming approach that effectively utilizes the GPT model for generating natural style descriptions in large volumes.
To address the need for generating audio with greater style diversity, we propose an efficient architecture called Salle.
arXiv Detail & Related papers (2023-08-28T09:06:32Z) - Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature
Alignment [16.304894187743013]
TEFAL is a TExt-conditioned Feature ALignment method that produces both audio and video representations conditioned on the text query.
Our approach employs two independent cross-modal attention blocks that enable the text to attend to the audio and video representations separately.
arXiv Detail & Related papers (2023-07-24T17:43:13Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained
Language-Vision Models [50.42886595228255]
We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge.
We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining model.
arXiv Detail & Related papers (2023-06-16T05:42:01Z) - Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis [66.43223397997559]
We aim to synthesize high-quality talking portrait videos corresponding to the input text.
This task has broad application prospects in the digital human industry but has not been technically achieved yet.
We introduce Adaptive Text-to-Talking Avatar (Ada-TTA), which designs a generic zero-shot multi-speaker Text-to-Speech model.
arXiv Detail & Related papers (2023-06-06T08:50:13Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - Text2Video: Text-driven Talking-head Video Synthesis with Phonetic
Dictionary [10.590649169151055]
We present a novel approach to synthesize video from the text.
The method builds a phoneme-pose dictionary and trains a generative adversarial network (GAN) to generate video.
Compared to audio-driven video generation algorithms, our approach has a number of advantages.
arXiv Detail & Related papers (2021-04-29T19:54:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.