Make Pixels Dance: High-Dynamic Video Generation
- URL: http://arxiv.org/abs/2311.10982v1
- Date: Sat, 18 Nov 2023 06:25:58 GMT
- Title: Make Pixels Dance: High-Dynamic Video Generation
- Authors: Yan Zeng, Guoqiang Wei, Jiani Zheng, Jiaxin Zou, Yang Wei, Yuchen
Zhang, Hang Li
- Abstract summary: State-of-the-art video generation methods tend to produce video clips with minimal motions despite maintaining high fidelity.
We introduce PixelDance, a novel approach that incorporates image instructions for both the first and last frames in conjunction with text instructions for video generation.
- Score: 13.944607760918997
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Creating high-dynamic videos such as motion-rich actions and sophisticated
visual effects poses a significant challenge in the field of artificial
intelligence. Unfortunately, current state-of-the-art video generation methods,
primarily focusing on text-to-video generation, tend to produce video clips
with minimal motions despite maintaining high fidelity. We argue that relying
solely on text instructions is insufficient and suboptimal for video
generation. In this paper, we introduce PixelDance, a novel approach based on
diffusion models that incorporates image instructions for both the first and
last frames in conjunction with text instructions for video generation.
Comprehensive experimental results demonstrate that PixelDance trained with
public data exhibits significantly better proficiency in synthesizing videos
with complex scenes and intricate motions, setting a new standard for video
generation.
Related papers
- Every Image Listens, Every Image Dances: Music-Driven Image Animation [8.085267959520843]
MuseDance is an end-to-end model that animates reference images using both music and text inputs.
Unlike existing approaches, MuseDance eliminates the need for complex motion guidance inputs, such as pose or depth sequences.
We present a new multimodal dataset comprising 2,904 dance videos with corresponding background music and text descriptions.
arXiv Detail & Related papers (2025-01-30T23:38:51Z) - Video Creation by Demonstration [59.389591010842636]
We present $delta$-Diffusion, a self-supervised training approach that learns from unlabeled videos by conditional future frame prediction.
By leveraging a video foundation model with an appearance bottleneck design on top, we extract action latents from demonstration videos for conditioning the generation process.
Empirically, $delta$-Diffusion outperforms related baselines in terms of both human preference and large-scale machine evaluations.
arXiv Detail & Related papers (2024-12-12T18:41:20Z) - Fleximo: Towards Flexible Text-to-Human Motion Video Generation [17.579663311741072]
We introduce a novel task aimed at generating human motion videos solely from reference images and natural language.
We propose a new framework called Fleximo, which leverages large-scale pre-trained text-to-3D motion models.
To assess the performance of Fleximo, we introduce a new benchmark called MotionBench, which includes 400 videos across 20 identities and 20 motions.
arXiv Detail & Related papers (2024-11-29T04:09:13Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - Make-Your-Video: Customized Video Generation Using Textual and
Structural Guidance [36.26032505627126]
Recent advancements in text-to-video synthesis have unveiled the potential to achieve this with prompts only.
In this paper, we explore customized video generation by utilizing text as context description and motion structure.
Our method, dubbed Make-Your-Video, involves joint-conditional video generation using a Latent Diffusion Model.
arXiv Detail & Related papers (2023-06-01T17:43:27Z) - Text2Performer: Text-Driven Human Video Generation [97.3849869893433]
Text-driven content creation has evolved to be a transformative technique that revolutionizes creativity.
Here we study the task of text-driven human video generation, where a video sequence is synthesized from texts describing the appearance and motions of a target performer.
In this work, we present Text2Performer to generate vivid human videos with articulated motions from texts.
arXiv Detail & Related papers (2023-04-17T17:59:02Z) - Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video
Generators [70.17041424896507]
Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets.
We propose a new task of zero-shot text-to-video generation using existing text-to-image synthesis methods.
Our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data.
arXiv Detail & Related papers (2023-03-23T17:01:59Z) - Dreamix: Video Diffusion Models are General Video Editors [22.127604561922897]
Text-driven image and video diffusion models have recently achieved unprecedented generation realism.
We present the first diffusion-based method that is able to perform text-based motion and appearance editing of general videos.
arXiv Detail & Related papers (2023-02-02T18:58:58Z) - Render In-between: Motion Guided Video Synthesis for Action
Interpolation [53.43607872972194]
We propose a motion-guided frame-upsampling framework that is capable of producing realistic human motion and appearance.
A novel motion model is trained to inference the non-linear skeletal motion between frames by leveraging a large-scale motion-capture dataset.
Our pipeline only requires low-frame-rate videos and unpaired human motion data but does not require high-frame-rate videos for training.
arXiv Detail & Related papers (2021-11-01T15:32:51Z) - Video Generation from Text Employing Latent Path Construction for
Temporal Modeling [70.06508219998778]
Video generation is one of the most challenging tasks in Machine Learning and Computer Vision fields of study.
In this paper, we tackle the text to video generation problem, which is a conditional form of video generation.
We believe that video generation from natural language sentences will have an important impact on Artificial Intelligence.
arXiv Detail & Related papers (2021-07-29T06:28:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.