LAMP: Language-Assisted Motion Planning for Controllable Video Generation
- URL: http://arxiv.org/abs/2512.03619v2
- Date: Mon, 08 Dec 2025 11:47:40 GMT
- Title: LAMP: Language-Assisted Motion Planning for Controllable Video Generation
- Authors: Muhammed Burak Kizil, Enes Sanli, Niloy J. Mitra, Erkut Erdem, Aykut Erdem, Duygu Ceylan,
- Abstract summary: We introduce LAMP that leverages large language models (LLMs) as motion planners to translate natural language descriptions into explicit 3D trajectories for dynamic objects and cameras.<n>LLMs generate structured motion programs from natural language, which are deterministically mapped to 3D trajectories.<n>Experiments demonstrate LAMP's improved performance in motion controllability and alignment with user intent compared to state-of-the-art alternatives.
- Score: 46.55844620442438
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video generation has achieved remarkable progress in visual fidelity and controllability, enabling conditioning on text, layout, or motion. Among these, motion control - specifying object dynamics and camera trajectories - is essential for composing complex, cinematic scenes, yet existing interfaces remain limited. We introduce LAMP that leverages large language models (LLMs) as motion planners to translate natural language descriptions into explicit 3D trajectories for dynamic objects and (relatively defined) cameras. LAMP defines a motion domain-specific language (DSL), inspired by cinematography conventions. By harnessing program synthesis capabilities of LLMs, LAMP generates structured motion programs from natural language, which are deterministically mapped to 3D trajectories. We construct a large-scale procedural dataset pairing natural text descriptions with corresponding motion programs and 3D trajectories. Experiments demonstrate LAMP's improved performance in motion controllability and alignment with user intent compared to state-of-the-art alternatives establishing the first framework for generating both object and camera motions directly from natural language specifications. Code, models and data are available on our project page.
Related papers
- ATI: Any Trajectory Instruction for Controllable Video Generation [25.249489701215467]
We propose a unified framework for motion control in video generation that seamlessly integrates camera movement, object-level translation, and fine-grained local motion.<n>Our approach offers a cohesive solution by projecting user-defined trajectories into the latent space of pre-trained image-to-video generation models.
arXiv Detail & Related papers (2025-05-28T23:49:18Z) - CineMaster: A 3D-Aware and Controllable Framework for Cinematic Text-to-Video Generation [76.72787726497343]
We present CineMaster, a framework for 3D-aware and controllable text-to-video generation.<n>Our goal is to empower users with comparable controllability as professional film directors.
arXiv Detail & Related papers (2025-02-12T18:55:36Z) - MotionAgent: Fine-grained Controllable Video Generation via Motion Field Agent [55.15697390165972]
We propose MotionAgent, enabling fine-grained motion control for text-guided image-to-video generation.<n>The key technique is the motion field agent that converts motion information in text prompts into explicit motion fields.<n>We construct a subset of VBench to evaluate the alignment of motion information in the text and the generated video, outperforming other advanced models on motion generation accuracy.
arXiv Detail & Related papers (2025-02-05T14:26:07Z) - Motion Prompting: Controlling Video Generation with Motion Trajectories [57.049252242807874]
We train a video generation model conditioned on sparse or dense video trajectories.<n>We translate high-level user requests into detailed, semi-dense motion prompts.<n>We demonstrate our approach through various applications, including camera and object motion control, "interacting" with an image, motion transfer, and image editing.
arXiv Detail & Related papers (2024-12-03T18:59:56Z) - LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Models [57.92316645992816]
Spatial reasoning is a fundamental aspect of human cognition, enabling intuitive understanding and manipulation of objects in three-dimensional space.<n>We introduce LayoutVLM, a framework and scene layout representation that exploits the semantic knowledge of Vision-Language Models (VLMs)<n>We demonstrate that fine-tuning VLMs with the proposed scene layout representation extracted from existing scene datasets can improve their reasoning performance.
arXiv Detail & Related papers (2024-12-03T06:15:04Z) - MotionGPT-2: A General-Purpose Motion-Language Model for Motion Generation and Understanding [76.30210465222218]
MotionGPT-2 is a unified Large Motion-Language Model (LMLMLM)
It supports multimodal control conditions through pre-trained Large Language Models (LLMs)
It is highly adaptable to the challenging 3D holistic motion generation task.
arXiv Detail & Related papers (2024-10-29T05:25:34Z) - LaMP: Language-Motion Pretraining for Motion Generation, Retrieval, and Captioning [19.801187860991117]
This work introduces LaMP, a novel Language-Motion Pretraining model.<n>LaMP generates motion-informative text embeddings, significantly enhancing the relevance and semantics of generated motion sequences.<n>For captioning, we finetune a large language model with the language-informative motion features to develop a strong motion captioning model.
arXiv Detail & Related papers (2024-10-09T17:33:03Z) - DartControl: A Diffusion-Based Autoregressive Motion Model for Real-Time Text-Driven Motion Control [12.465927271402442]
Text-conditioned human motion generation allows for user interaction through natural language.<n>DartControl is a Diffusion-based Autoregressive motion primitive model for Real-time Text-driven motion control.<n>Our model effectively learns a compact motion primitive space jointly conditioned on motion history and text inputs.
arXiv Detail & Related papers (2024-10-07T17:58:22Z) - InteractPro: A Unified Framework for Motion-Aware Image Composition [54.407337049352556]
We introduce InteractPro, a comprehensive framework for dynamic motion-aware image composition.<n>At its core is InteractPlan, an intelligent planner that leverages a Large Vision Language Model (LVLM) for scenario analysis and object placement.<n>Based on each scenario, InteractPlan selects between our two specialized modules: InteractPhys and InteractMotion.
arXiv Detail & Related papers (2024-09-16T08:44:17Z) - Image Conductor: Precision Control for Interactive Video Synthesis [90.2353794019393]
Filmmaking and animation production often require sophisticated techniques for coordinating camera transitions and object movements.
Image Conductor is a method for precise control of camera transitions and object movements to generate video assets from a single image.
arXiv Detail & Related papers (2024-06-21T17:55:05Z) - MotionScript: Natural Language Descriptions for Expressive 3D Human Motions [6.710007544943157]
We introduce MotionScript, a novel framework for generating highly detailed, natural language descriptions of 3D human motions.<n>MotionScript provides fine-grained, structured descriptions that capture the full complexity of human movement.<n> MotionScript serves as both a descriptive tool and a training resource for text-to-motion models.
arXiv Detail & Related papers (2023-12-19T22:33:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.