Fleximo: Towards Flexible Text-to-Human Motion Video Generation
- URL: http://arxiv.org/abs/2411.19459v1
- Date: Fri, 29 Nov 2024 04:09:13 GMT
- Title: Fleximo: Towards Flexible Text-to-Human Motion Video Generation
- Authors: Yuhang Zhang, Yuan Zhou, Zeyu Liu, Yuxuan Cai, Qiuyue Wang, Aidong Men, Huan Yang,
- Abstract summary: We introduce a novel task aimed at generating human motion videos solely from reference images and natural language.
We propose a new framework called Fleximo, which leverages large-scale pre-trained text-to-3D motion models.
To assess the performance of Fleximo, we introduce a new benchmark called MotionBench, which includes 400 videos across 20 identities and 20 motions.
- Score: 17.579663311741072
- License:
- Abstract: Current methods for generating human motion videos rely on extracting pose sequences from reference videos, which restricts flexibility and control. Additionally, due to the limitations of pose detection techniques, the extracted pose sequences can sometimes be inaccurate, leading to low-quality video outputs. We introduce a novel task aimed at generating human motion videos solely from reference images and natural language. This approach offers greater flexibility and ease of use, as text is more accessible than the desired guidance videos. However, training an end-to-end model for this task requires millions of high-quality text and human motion video pairs, which are challenging to obtain. To address this, we propose a new framework called Fleximo, which leverages large-scale pre-trained text-to-3D motion models. This approach is not straightforward, as the text-generated skeletons may not consistently match the scale of the reference image and may lack detailed information. To overcome these challenges, we introduce an anchor point based rescale method and design a skeleton adapter to fill in missing details and bridge the gap between text-to-motion and motion-to-video generation. We also propose a video refinement process to further enhance video quality. A large language model (LLM) is employed to decompose natural language into discrete motion sequences, enabling the generation of motion videos of any desired length. To assess the performance of Fleximo, we introduce a new benchmark called MotionBench, which includes 400 videos across 20 identities and 20 motions. We also propose a new metric, MotionScore, to evaluate the accuracy of motion following. Both qualitative and quantitative results demonstrate that our method outperforms existing text-conditioned image-to-video generation methods. All code and model weights will be made publicly available.
Related papers
- Motion Prompting: Controlling Video Generation with Motion Trajectories [57.049252242807874]
We train a video generation model conditioned on sparse or dense video trajectories.
We translate high-level user requests into detailed, semi-dense motion prompts.
We demonstrate our approach through various applications, including camera and object motion control, "interacting" with an image, motion transfer, and image editing.
arXiv Detail & Related papers (2024-12-03T18:59:56Z) - MoTrans: Customized Motion Transfer with Text-driven Video Diffusion Models [59.10171699717122]
MoTrans is a customized motion transfer method enabling video generation of similar motion in new context.
multimodal representations from recaptioned prompt and video frames promote the modeling of appearance.
Our method effectively learns specific motion pattern from singular or multiple reference videos.
arXiv Detail & Related papers (2024-12-02T10:07:59Z) - Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion [9.134743677331517]
We propose a pre-trained image-to-video model to disentangle appearance from motion.
Our method, called motion-textual inversion, leverages our observation that image-to-video models extract appearance mainly from the (latent) image input.
By operating on an inflated motion-text embedding containing multiple text/image embedding tokens per frame, we achieve a high temporal motion granularity.
Our approach does not require spatial alignment between the motion reference video and target image, generalizes across various domains, and can be applied to various tasks.
arXiv Detail & Related papers (2024-08-01T10:55:20Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - VMC: Video Motion Customization using Temporal Attention Adaption for
Text-to-Video Diffusion Models [58.93124686141781]
Video Motion Customization (VMC) is a novel one-shot tuning approach crafted to adapt temporal attention layers within video diffusion models.
Our approach introduces a novel motion distillation objective using residual vectors between consecutive frames as a motion reference.
We validate our method against state-of-the-art video generative models across diverse real-world motions and contexts.
arXiv Detail & Related papers (2023-12-01T06:50:11Z) - LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation [44.220329202024494]
We present a few-shot-based tuning framework, LAMP, which enables text-to-image diffusion model Learn A specific Motion Pattern with 816 videos on a single GPU.
Specifically, we design a first-frame-conditioned pipeline that uses an off-the-shelf text-to-image model for content generation.
To capture the features of temporal dimension, we expand the pretrained 2D convolution layers of the T2I model to our novel temporal-spatial motion learning layers.
arXiv Detail & Related papers (2023-10-16T19:03:19Z) - Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video
Generators [70.17041424896507]
Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets.
We propose a new task of zero-shot text-to-video generation using existing text-to-image synthesis methods.
Our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data.
arXiv Detail & Related papers (2023-03-23T17:01:59Z) - Show Me What and Tell Me How: Video Synthesis via Multimodal
Conditioning [36.85533835408882]
This work presents a multimodal video generation framework that benefits from text and images provided jointly or separately.
We propose a new video token trained with self-learning and an improved mask-prediction algorithm for sampling video tokens.
Our framework can incorporate various visual modalities, such as segmentation masks, drawings, and partially occluded images.
arXiv Detail & Related papers (2022-03-04T21:09:13Z) - Render In-between: Motion Guided Video Synthesis for Action
Interpolation [53.43607872972194]
We propose a motion-guided frame-upsampling framework that is capable of producing realistic human motion and appearance.
A novel motion model is trained to inference the non-linear skeletal motion between frames by leveraging a large-scale motion-capture dataset.
Our pipeline only requires low-frame-rate videos and unpaired human motion data but does not require high-frame-rate videos for training.
arXiv Detail & Related papers (2021-11-01T15:32:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.