Related papers: CPA: Camera-pose-awareness Diffusion Transformer for Video Generation

CPA: Camera-pose-awareness Diffusion Transformer for Video Generation

URL: http://arxiv.org/abs/2412.01429v1
Date: Mon, 02 Dec 2024 12:10:00 GMT
Title: CPA: Camera-pose-awareness Diffusion Transformer for Video Generation
Authors: Yuelei Wang, Jian Zhang, Pengtao Jiang, Hao Zhang, Jinwei Chen, Bo Li,
Abstract summary: CPA is a text-to-video generation approach that integrates the textual, visual, and spatial conditions.<n>Our method outperforms LDM-based methods for long video generation while achieving optimal performance in trajectory consistency and object consistency.
Score: 15.512186399114999
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite the significant advancements made by Diffusion Transformer (DiT)-based methods in video generation, there remains a notable gap with controllable camera pose perspectives. Existing works such as OpenSora do NOT adhere precisely to anticipated trajectories and physical interactions, thereby limiting the flexibility in downstream applications. To alleviate this issue, we introduce CPA, a unified camera-pose-awareness text-to-video generation approach that elaborates the camera movement and integrates the textual, visual, and spatial conditions. Specifically, we deploy the Sparse Motion Encoding (SME) module to transform camera pose information into a spatial-temporal embedding and activate the Temporal Attention Injection (TAI) module to inject motion patches into each ST-DiT block. Our plug-in architecture accommodates the original DiT parameters, facilitating diverse types of camera poses and flexible object movement. Extensive qualitative and quantitative experiments demonstrate that our method outperforms LDM-based methods for long video generation while achieving optimal performance in trajectory consistency and object consistency.

Related papers

UCM: Unifying Camera Control and Memory with Time-aware Positional Encoding Warping for World Models [54.564740558030245]
We present UCM, a novel framework that unifies long-term memory and precise camera control via a time-aware positional encoding warping mechanism.<n>We also introduce a scalable data curation strategy utilizing point-cloud-based rendering to simulate scene revisiting.
arXiv Detail & Related papers (2026-02-26T12:54:46Z)
MotionFlow:Learning Implicit Motion Flow for Complex Camera Trajectory Control in Video Generation [30.528654507198052]
We propose a novel approach that integrates both camera and object motions by converting them into the motion of corresponding pixels.<n>Our model outperforms SOTA methods by a large margin.
arXiv Detail & Related papers (2025-09-25T13:06:12Z)
Motion-Aware Concept Alignment for Consistent Video Editing [57.08108545219043]
We introduce MoCA-Video (Motion-Aware Concept Alignment in Video), a training-free framework bridging the gap between image-domain semantic mixing and video.<n>Given a generated video and a user-provided reference image, MoCA-Video injects the semantic features of the reference image into a specific object within the video.<n>We evaluate MoCA's performance using the standard SSIM, image-level LPIPS, temporal LPIPS, and introduce a novel metric CASS (Conceptual Alignment Shift Score) to evaluate the consistency and effectiveness of the visual shifts between the source prompt and the modified video frames
arXiv Detail & Related papers (2025-06-01T13:28:04Z)
ATI: Any Trajectory Instruction for Controllable Video Generation [25.249489701215467]
We propose a unified framework for motion control in video generation that seamlessly integrates camera movement, object-level translation, and fine-grained local motion.<n>Our approach offers a cohesive solution by projecting user-defined trajectories into the latent space of pre-trained image-to-video generation models.
arXiv Detail & Related papers (2025-05-28T23:49:18Z)
LMP: Leveraging Motion Prior in Zero-Shot Video Generation with Diffusion Transformer [10.44905923812975]
We propose the Leveraging Motion Prior (LMP) framework for zero-shot video generation.<n>Our framework harnesses the powerful generative capabilities of pre-trained diffusion transformers to enable motion in the generated videos to reference user-provided motion videos.<n>Our approach achieves state-of-the-art performance in generation quality, prompt-video consistency, and control capability.
arXiv Detail & Related papers (2025-05-20T10:18:29Z)
LightMotion: A Light and Tuning-free Method for Simulating Camera Motion in Video Generation [56.64004196498026]
LightMotion is a light and tuning-free method for simulating camera motion in video generation. operating in the latent space, it eliminates additional fine-tuning, inpainting, and depth estimation.
arXiv Detail & Related papers (2025-03-09T08:28:40Z)
MotionAgent: Fine-grained Controllable Video Generation via Motion Field Agent [58.09607975296408]
We propose MotionAgent, enabling fine-grained motion control for text-guided image-to-video generation. The key technique is the motion field agent that converts motion information in text prompts into explicit motion fields. We construct a subset of VBench to evaluate the alignment of motion information in the text and the generated video, outperforming other advanced models on motion generation accuracy.
arXiv Detail & Related papers (2025-02-05T14:26:07Z)
Free-Form Motion Control: A Synthetic Video Generation Dataset with Controllable Camera and Object Motions [78.65431951506152]
We introduce a Synthetic dataset for Free-Form Motion Control (SynFMC) The proposed SynFMC dataset includes diverse objects and environments and covers various motion patterns according to specific rules. We further propose a method, Free-Form Motion Control (FMC), which enables independent or simultaneous control of object and camera movements.
arXiv Detail & Related papers (2025-01-02T18:59:45Z)
VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control [74.5434726968562]
We tame transformers video for 3D camera control using a ControlNet-like conditioning mechanism based on Plucker coordinates. Our work is the first to enable camera control for transformer-based video diffusion models.
arXiv Detail & Related papers (2024-07-17T17:59:05Z)
Image Conductor: Precision Control for Interactive Video Synthesis [90.2353794019393]
Filmmaking and animation production often require sophisticated techniques for coordinating camera transitions and object movements. Image Conductor is a method for precise control of camera transitions and object movements to generate video assets from a single image.
arXiv Detail & Related papers (2024-06-21T17:55:05Z)
Motion Segmentation for Neuromorphic Aerial Surveillance [42.04157319642197]
Event cameras offer superior temporal resolution, superior dynamic range, and minimal power requirements. Unlike traditional frame-based sensors that capture redundant information at fixed intervals, event cameras asynchronously record pixel-level brightness changes. We introduce a novel motion segmentation method that leverages self-supervised vision transformers on both event data and optical flow information.
arXiv Detail & Related papers (2024-05-24T04:36:13Z)
MotionMaster: Training-free Camera Motion Transfer For Video Generation [48.706578330771386]
We propose a novel training-free video motion transfer model, which disentangles camera motions and object motions in source videos. Our model can effectively decouple camera-object motion and apply the decoupled camera motion to a wide range of controllable video generation tasks.
arXiv Detail & Related papers (2024-04-24T10:28:54Z)
Spectral Motion Alignment for Video Motion Transfer using Diffusion Models [54.32923808964701]
Spectral Motion Alignment (SMA) is a framework that refines and aligns motion vectors using Fourier and wavelet transforms. SMA learns motion patterns by incorporating frequency-domain regularization, facilitating the learning of whole-frame global motion dynamics. Extensive experiments demonstrate SMA's efficacy in improving motion transfer while maintaining computational efficiency and compatibility across various video customization frameworks.
arXiv Detail & Related papers (2024-03-22T14:47:18Z)
Direct-a-Video: Customized Video Generation with User-Directed Camera Movement and Object Motion [34.404342332033636]
We introduce Direct-a-Video, a system that allows users to independently specify motions for multiple objects as well as camera's pan and zoom movements. For camera movement, we introduce new temporal cross-attention layers to interpret quantitative camera movement parameters. Both components operate independently, allowing individual or combined control, and can generalize to open-domain scenarios.
arXiv Detail & Related papers (2024-02-05T16:30:57Z)
Learning Variational Motion Prior for Video-based Motion Capture [31.79649766268877]
We present a novel variational motion prior (VMP) learning approach for video-based motion capture. Our framework can effectively reduce temporal jittering and failure modes in frame-wise pose estimation. Experiments over both public datasets and in-the-wild videos have demonstrated the efficacy and generalization capability of our framework.
arXiv Detail & Related papers (2022-10-27T02:45:48Z)
ParticleSfM: Exploiting Dense Point Trajectories for Localizing Moving Cameras in the Wild [57.37891682117178]
We present a robust dense indirect structure-from-motion method for videos that is based on dense correspondence from pairwise optical flow. A novel neural network architecture is proposed for processing irregular point trajectory data. Experiments on MPI Sintel dataset show that our system produces significantly more accurate camera trajectories.
arXiv Detail & Related papers (2022-07-19T09:19:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.