Unified Camera Positional Encoding for Controlled Video Generation
- URL: http://arxiv.org/abs/2512.07237v1
- Date: Mon, 08 Dec 2025 07:34:01 GMT
- Title: Unified Camera Positional Encoding for Controlled Video Generation
- Authors: Cheng Zhang, Boying Li, Meng Wei, Yan-Pei Cao, Camilo Cruz Gambardella, Dinh Phung, Jianfei Cai,
- Abstract summary: Transformers have emerged as a universal backbone across 3D perception, video generation, and world models for autonomous driving and embodied AI.<n>We introduce Relative Ray, a geometry-consistent representation that unifies complete camera information, including 6-DoF poses, intrinsics, and lens distortions.<n>To facilitate systematic training and evaluation, we construct a large video dataset covering a wide range of camera motions and lens types.
- Score: 48.5789182990001
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformers have emerged as a universal backbone across 3D perception, video generation, and world models for autonomous driving and embodied AI, where understanding camera geometry is essential for grounding visual observations in three-dimensional space. However, existing camera encoding methods often rely on simplified pinhole assumptions, restricting generalization across the diverse intrinsics and lens distortions in real-world cameras. We introduce Relative Ray Encoding, a geometry-consistent representation that unifies complete camera information, including 6-DoF poses, intrinsics, and lens distortions. To evaluate its capability under diverse controllability demands, we adopt camera-controlled text-to-video generation as a testbed task. Within this setting, we further identify pitch and roll as two components effective for Absolute Orientation Encoding, enabling full control over the initial camera orientation. Together, these designs form UCPE (Unified Camera Positional Encoding), which integrates into a pretrained video Diffusion Transformer through a lightweight spatial attention adapter, adding less than 1% trainable parameters while achieving state-of-the-art camera controllability and visual fidelity. To facilitate systematic training and evaluation, we construct a large video dataset covering a wide range of camera motions and lens types. Extensive experiments validate the effectiveness of UCPE in camera-controllable video generation and highlight its potential as a general camera representation for Transformers across future multi-view, video, and 3D tasks. Code will be available at https://github.com/chengzhag/UCPE.
Related papers
- Beyond Inpainting: Unleash 3D Understanding for Precise Camera-Controlled Video Generation [21.084121261693365]
We propose DepthDirector, a video re-rendering framework with precise camera controllability.<n>By leveraging the depth video from explicit 3D representation as camera-control guidance, our method can faithfully reproduce the dynamic scene of an input video under novel camera trajectories.
arXiv Detail & Related papers (2026-01-15T09:26:45Z) - CETCAM: Camera-Controllable Video Generation via Consistent and Extensible Tokenization [32.42754288735215]
CETCAM is a camera-controllable video generation framework.<n>It eliminates the need for camera annotations through a consistent and tokenization scheme.<n>It learns robust camera controllability from diverse raw video data and refines fine-grained visual quality using high-fidelity datasets.
arXiv Detail & Related papers (2025-12-22T04:21:39Z) - Infinite-Homography as Robust Conditioning for Camera-Controlled Video Generation [49.12018869332346]
InfCam is a camera-controlled video-to-video generation framework with high pose fidelity.<n>The framework integrates two key components: (1) infinite homography warping, which encodes 3D camera rotations directly within the 2D latent space of a video diffusion model.
arXiv Detail & Related papers (2025-12-18T20:03:05Z) - ReDirector: Creating Any-Length Video Retakes with Rotary Camera Encoding [60.574308105414026]
ReDirector is a camera-controlled video generation method for dynamically captured variable-length.<n>We introduce Rotary Camera.<n>RoCE, a camera that integrates RoPE and retake-conditioned cameras.<n>Our method generalizes to out-of-distribution camera trajectories and video lengths, yielding improved dynamic object localization and static background preservation.
arXiv Detail & Related papers (2025-11-25T01:38:56Z) - Cameras as Relative Positional Encoding [37.675563572777136]
Multi-view transformers must use camera geometry to ground visual tokens in 3D space.<n>We show how relative camera conditioning improves performance in feedforward novel view synthesis.<n>We then verify that these benefits persist for different tasks, stereo depth estimation and discriminative cognition, as well as larger model sizes.
arXiv Detail & Related papers (2025-07-14T17:22:45Z) - Uni3C: Unifying Precisely 3D-Enhanced Camera and Human Motion Controls for Video Generation [73.73984727616198]
We present Uni3C, a unified framework for precise control of both camera and human motion in video generation.<n>First, we propose a plug-and-play control module trained with a frozen video generative backbone, PCDController.<n>Second, we propose a jointly aligned 3D world guidance for the inference phase that seamlessly integrates both scenic point clouds and SMPL-X characters.
arXiv Detail & Related papers (2025-04-21T07:10:41Z) - AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers [66.29824750770389]
We analyze camera motion from a first principles perspective, uncovering insights that enable precise 3D camera manipulation.<n>We compound these findings to design the Advanced 3D Camera Control (AC3D) architecture.
arXiv Detail & Related papers (2024-11-27T18:49:13Z) - VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control [74.5434726968562]
We show how to tame transformers video for 3D camera control using a ControlNet-like conditioning mechanism.<n>Our work is the first to enable camera control for transformer-based video diffusion models.
arXiv Detail & Related papers (2024-07-17T17:59:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.