Text2Performer: Text-Driven Human Video Generation
- URL: http://arxiv.org/abs/2304.08483v1
- Date: Mon, 17 Apr 2023 17:59:02 GMT
- Title: Text2Performer: Text-Driven Human Video Generation
- Authors: Yuming Jiang, Shuai Yang, Tong Liang Koh, Wayne Wu, Chen Change Loy,
Ziwei Liu
- Abstract summary: Text-driven content creation has evolved to be a transformative technique that revolutionizes creativity.
Here we study the task of text-driven human video generation, where a video sequence is synthesized from texts describing the appearance and motions of a target performer.
In this work, we present Text2Performer to generate vivid human videos with articulated motions from texts.
- Score: 97.3849869893433
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-driven content creation has evolved to be a transformative technique
that revolutionizes creativity. Here we study the task of text-driven human
video generation, where a video sequence is synthesized from texts describing
the appearance and motions of a target performer. Compared to general
text-driven video generation, human-centric video generation requires
maintaining the appearance of synthesized human while performing complex
motions. In this work, we present Text2Performer to generate vivid human videos
with articulated motions from texts. Text2Performer has two novel designs: 1)
decomposed human representation and 2) diffusion-based motion sampler. First,
we decompose the VQVAE latent space into human appearance and pose
representation in an unsupervised manner by utilizing the nature of human
videos. In this way, the appearance is well maintained along the generated
frames. Then, we propose continuous VQ-diffuser to sample a sequence of pose
embeddings. Unlike existing VQ-based methods that operate in the discrete
space, continuous VQ-diffuser directly outputs the continuous pose embeddings
for better motion modeling. Finally, motion-aware masking strategy is designed
to mask the pose embeddings spatial-temporally to enhance the temporal
coherence. Moreover, to facilitate the task of text-driven human video
generation, we contribute a Fashion-Text2Video dataset with manually annotated
action labels and text descriptions. Extensive experiments demonstrate that
Text2Performer generates high-quality human videos (up to 512x256 resolution)
with diverse appearances and flexible motions.
Related papers
- Text-Animator: Controllable Visual Text Video Generation [149.940821790235]
We propose an innovative approach termed Text-Animator for visual text video generation.
Text-Animator contains a text embedding injection module to precisely depict the structures of visual text in generated videos.
We also develop a camera control module and a text refinement module to improve the stability of generated visual text.
arXiv Detail & Related papers (2024-06-25T17:59:41Z) - Towards 4D Human Video Stylization [56.33756124829298]
We present a first step towards 4D (3D and time) human video stylization, which addresses style transfer, novel view synthesis and human animation.
We leverage Neural Radiance Fields (NeRFs) to represent videos, conducting stylization in the rendered feature space.
Our framework uniquely extends its capabilities to accommodate novel poses and viewpoints, making it a versatile tool for creative human video stylization.
arXiv Detail & Related papers (2023-12-07T08:58:33Z) - Make Pixels Dance: High-Dynamic Video Generation [13.944607760918997]
State-of-the-art video generation methods tend to produce video clips with minimal motions despite maintaining high fidelity.
We introduce PixelDance, a novel approach that incorporates image instructions for both the first and last frames in conjunction with text instructions for video generation.
arXiv Detail & Related papers (2023-11-18T06:25:58Z) - Fg-T2M: Fine-Grained Text-Driven Human Motion Generation via Diffusion
Model [11.873294782380984]
We propose a fine-grained method for generating high-quality, conditional human motion sequences supporting precise text description.
Our approach consists of two key components: 1) a linguistics-structure assisted module that constructs accurate and complete language feature to fully utilize text information; and 2) a context-aware progressive reasoning module that learns neighborhood and overall semantic linguistics features from shallow and deep graph neural networks to achieve a multi-step inference.
arXiv Detail & Related papers (2023-09-12T14:43:47Z) - Dancing Avatar: Pose and Text-Guided Human Motion Videos Synthesis with
Image Diffusion Model [57.855362366674264]
We propose Dancing Avatar, designed to fabricate human motion videos driven by poses and textual cues.
Our approach employs a pretrained T2I diffusion model to generate each video frame in an autoregressive fashion.
arXiv Detail & Related papers (2023-08-15T13:00:42Z) - Make-A-Video: Text-to-Video Generation without Text-Video Data [69.20996352229422]
Make-A-Video is an approach for translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V)
We design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules.
In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation.
arXiv Detail & Related papers (2022-09-29T13:59:46Z) - TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of
3D Human Motions and Texts [20.336481832461168]
Inspired by the strong ties between vision and language, our paper aims to explore the generation of 3D human full-body motions from texts.
We propose the use of motion token, a discrete and compact motion representation.
Our approach is flexible, could be used for both text2motion and motion2text tasks.
arXiv Detail & Related papers (2022-07-04T19:52:18Z) - TEMOS: Generating diverse human motions from textual descriptions [53.85978336198444]
We address the problem of generating diverse 3D human motions from textual descriptions.
We propose TEMOS, a text-conditioned generative model leveraging variational autoencoder (VAE) training with human motion data.
We show that TEMOS framework can produce both skeleton-based animations as in prior work, as well more expressive SMPL body motions.
arXiv Detail & Related papers (2022-04-25T14:53:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.