Modular-Cam: Modular Dynamic Camera-view Video Generation with LLM
- URL: http://arxiv.org/abs/2504.12048v1
- Date: Wed, 16 Apr 2025 13:04:01 GMT
- Title: Modular-Cam: Modular Dynamic Camera-view Video Generation with LLM
- Authors: Zirui Pan, Xin Wang, Yipeng Zhang, Hong Chen, Kwan Man Cheng, Yaofei Wu, Wenwu Zhu,
- Abstract summary: We propose a novel text-to-video generation method, i.e., Modular-Cam.<n>To better understand a given complex prompt, we utilize a large language model to analyze user instructions.<n>To generate a video containing dynamic scenes that match the given camera-views, we incorporate the widely-used temporal transformer.
- Score: 43.889033468684445
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-to-Video generation, which utilizes the provided text prompt to generate high-quality videos, has drawn increasing attention and achieved great success due to the development of diffusion models recently. Existing methods mainly rely on a pre-trained text encoder to capture the semantic information and perform cross attention with the encoded text prompt to guide the generation of video. However, when it comes to complex prompts that contain dynamic scenes and multiple camera-view transformations, these methods can not decompose the overall information into separate scenes, as well as fail to smoothly change scenes based on the corresponding camera-views. To solve these problems, we propose a novel method, i.e., Modular-Cam. Specifically, to better understand a given complex prompt, we utilize a large language model to analyze user instructions and decouple them into multiple scenes together with transition actions. To generate a video containing dynamic scenes that match the given camera-views, we incorporate the widely-used temporal transformer into the diffusion model to ensure continuity within a single scene and propose CamOperator, a modular network based module that well controls the camera movements. Moreover, we propose AdaControlNet, which utilizes ControlNet to ensure consistency across scenes and adaptively adjusts the color tone of the generated video. Extensive qualitative and quantitative experiments prove our proposed Modular-Cam's strong capability of generating multi-scene videos together with its ability to achieve fine-grained control of camera movements. Generated results are available at https://modular-cam.github.io.
Related papers
- OmniCam: Unified Multimodal Video Generation via Camera Control [42.94206239207397]
Camera control which achieves diverse visual effects by changing camera position and pose has attracted widespread attention.<n>Existing methods face challenges such as complex interaction and limited control capabilities.<n>We present OmniCam, a unified camera framework that generates guidance-temporally consistent videos.
arXiv Detail & Related papers (2025-04-03T06:38:30Z) - ReCamMaster: Camera-Controlled Generative Rendering from A Single Video [72.42376733537925]
ReCamMaster is a camera-controlled generative video re-rendering framework.<n>It reproduces the dynamic scene of an input video at novel camera trajectories.<n>Our method also finds promising applications in video stabilization, super-resolution, and outpainting.
arXiv Detail & Related papers (2025-03-14T17:59:31Z) - BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video Representations [82.94002870060045]
Existing video generation models struggle to follow complex text prompts and synthesize multiple objects.
We develop a blob-grounded video diffusion model named BlobGEN-Vid that allows users to control object motions and fine-grained object appearance.
We show that our framework is model-agnostic and build BlobGEN-Vid based on both U-Net and DiT-based video diffusion models.
arXiv Detail & Related papers (2025-01-13T19:17:06Z) - VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control [74.5434726968562]
We show how to tame transformers video for 3D camera control using a ControlNet-like conditioning mechanism.<n>Our work is the first to enable camera control for transformer-based video diffusion models.
arXiv Detail & Related papers (2024-07-17T17:59:05Z) - Training-free Camera Control for Video Generation [15.79168688275606]
We propose a training-free and robust solution to offer camera movement control for off-the-shelf video diffusion models.<n>Our method does not require any supervised finetuning on camera-annotated datasets or self-supervised training via data augmentation.<n>It can be plug-and-play with most pretrained video diffusion models and generate camera-controllable videos with a single image or text prompt as input.
arXiv Detail & Related papers (2024-06-14T15:33:00Z) - Collaborative Video Diffusion: Consistent Multi-video Generation with Camera Control [70.17137528953953]
Collaborative video diffusion (CVD) is trained on top of a state-of-the-art camera-control module for video generation.
CVD generates multiple videos rendered from different camera trajectories with significantly better consistency than baselines.
arXiv Detail & Related papers (2024-05-27T17:58:01Z) - CameraCtrl: Enabling Camera Control for Text-to-Video Generation [86.36135895375425]
Controllability plays a crucial role in video generation, as it allows users to create and edit content more precisely.<n>Existing models, however, lack control of camera pose that serves as a cinematic language to express deeper narrative nuances.<n>We introduce CameraCtrl, enabling accurate camera pose control for video diffusion models.
arXiv Detail & Related papers (2024-04-02T16:52:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.