Related papers: DirectorLLM for Human-Centric Video Generation

DirectorLLM for Human-Centric Video Generation

URL: http://arxiv.org/abs/2412.14484v1
Date: Thu, 19 Dec 2024 03:10:26 GMT
Title: DirectorLLM for Human-Centric Video Generation
Authors: Kunpeng Song, Tingbo Hou, Zecheng He, Haoyu Ma, Jialiang Wang, Animesh Sinha, Sam Tsai, Yaqiao Luo, Xiaoliang Dai, Li Chen, Xide Xia, Peizhao Zhang, Peter Vajda, Ahmed Elgammal, Felix Juefei-Xu,
Abstract summary: We introduce DirectorLLM, a novel video generation model that employs a large language model (LLM) to orchestrate human poses within videos.<n>Our model outperforms existing ones in generating videos with higher human motion fidelity, improved prompt faithfulness, and enhanced rendered subject naturalness.
Score: 46.37441947526771
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we introduce DirectorLLM, a novel video generation model that employs a large language model (LLM) to orchestrate human poses within videos. As foundational text-to-video models rapidly evolve, the demand for high-quality human motion and interaction grows. To address this need and enhance the authenticity of human motions, we extend the LLM from a text generator to a video director and human motion simulator. Utilizing open-source resources from Llama 3, we train the DirectorLLM to generate detailed instructional signals, such as human poses, to guide video generation. This approach offloads the simulation of human motion from the video generator to the LLM, effectively creating informative outlines for human-centric scenes. These signals are used as conditions by the video renderer, facilitating more realistic and prompt-following video generation. As an independent LLM module, it can be applied to different video renderers, including UNet and DiT, with minimal effort. Experiments on automatic evaluation benchmarks and human evaluations show that our model outperforms existing ones in generating videos with higher human motion fidelity, improved prompt faithfulness, and enhanced rendered subject naturalness.

Related papers

SMILE: Infusing Spatial and Motion Semantics in Masked Video Learning [50.98341607245458]
Masked video modeling is an effective paradigm for video self-supervised learning (SSL) This paper introduces a novel SSL approach for video representation learning, dubbed as SMILE, by infusing both spatial and motion semantics. We establish a new self-supervised video learning paradigm capable of learning strong video representations without requiring any natural video data.
arXiv Detail & Related papers (2025-04-01T08:20:55Z)
HumanDreamer: Generating Controllable Human-Motion Videos via Decoupled Generation [28.007696532331934]
We propose a decoupled human video generation framework that first generates diverse poses from text prompts. We present MotionDiT, which is trained to generate structured human-motion poses from text prompts. Our experiments across various Pose-to-Video baselines demonstrate that the poses generated by our method can produce diverse and high-quality human-motion videos.
arXiv Detail & Related papers (2025-03-31T12:51:45Z)
OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models [25.45077656291886]
We propose a Diffusion Transformer-based framework that scales up data by mixing motion-related conditions into the training phase. These designs enable OmniHuman to fully leverage data-driven motion generation, ultimately achieving highly realistic human video generation. Compared to existing end-to-end audio-driven methods, OmniHuman not only produces more realistic videos, but also offers greater flexibility in inputs.
arXiv Detail & Related papers (2025-02-03T05:17:32Z)
Move-in-2D: 2D-Conditioned Human Motion Generation [54.067588636155115]
We propose Move-in-2D, a novel approach to generate human motion sequences conditioned on a scene image. Our approach accepts both a scene image and text prompt as inputs, producing a motion sequence tailored to the scene.
arXiv Detail & Related papers (2024-12-17T18:58:07Z)
Towards motion from video diffusion models [10.493424298717864]
We propose to synthesize human motion by deforming an SMPL-X body representation guided by Score distillation sampling (SDS) calculated using a video diffusion model. By analyzing the fidelity of the resulting animations, we gain insights into the extent to which we can obtain motion using publicly available text-to-video diffusion models.
arXiv Detail & Related papers (2024-11-19T19:35:28Z)
HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation [64.37874983401221]
We present HumanVid, the first large-scale high-quality dataset tailored for human image animation. For the real-world data, we compile a vast collection of real-world videos from the internet. For the synthetic data, we collected 10K 3D avatar assets and leveraged existing assets of body shapes, skin textures and clothings.
arXiv Detail & Related papers (2024-07-24T17:15:58Z)
MotionLLM: Understanding Human Behaviors from Human Motions and Videos [40.132643319573205]
This study delves into the realm of multi-modality (i.e., video and motion modalities) human behavior understanding. We present MotionLLM, a framework for human motion understanding, captioning, and reasoning.
arXiv Detail & Related papers (2024-05-30T17:59:50Z)
Motion-Agent: A Conversational Framework for Human Motion Generation with LLMs [67.59291068131438]
Motion-Agent is a conversational framework designed for general human motion generation, editing, and understanding. Motion-Agent employs an open-source pre-trained language model to develop a generative agent, MotionLLM, that bridges the gap between motion and text.
arXiv Detail & Related papers (2024-05-27T09:57:51Z)
LLM-grounded Video Diffusion Models [57.23066793349706]
Video diffusion models have emerged as a promising tool for neuraltemporal generation. Current models struggle with prompts and often restricted or incorrect motion. We introduce LLM-grounded Video Diffusion (LVD) Our results demonstrate that LVD significantly outperforms its base video diffusion model.
arXiv Detail & Related papers (2023-09-29T17:54:46Z)
MotionGPT: Finetuned LLMs Are General-Purpose Motion Generators [108.67006263044772]
This paper presents a Motion General-Purpose generaTor (MotionGPT) that can use multimodal control signals. We first quantize multimodal control signals into discrete codes and then formulate them in a unified prompt instruction. Our MotionGPT demonstrates a unified human motion generation model with multimodal control signals by tuning a mere 0.4% of LLM parameters.
arXiv Detail & Related papers (2023-06-19T12:58:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.