BulletTime: Decoupled Control of Time and Camera Pose for Video Generation
- URL: http://arxiv.org/abs/2512.05076v1
- Date: Thu, 04 Dec 2025 18:40:52 GMT
- Title: BulletTime: Decoupled Control of Time and Camera Pose for Video Generation
- Authors: Yiming Wang, Qihang Zhang, Shengqu Cai, Tong Wu, Jan Ackermann, Zhengfei Kuang, Yang Zheng, Frano Rajič, Siyu Tang, Gordon Wetzstein,
- Abstract summary: We introduce a 4D-controllable video diffusion framework that explicitly decouples scene dynamics from camera pose.<n>We show that our model achieves robust real-world 4D control across diverse timing patterns and camera trajectories.
- Score: 48.835425748367875
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Emerging video diffusion models achieve high visual fidelity but fundamentally couple scene dynamics with camera motion, limiting their ability to provide precise spatial and temporal control. We introduce a 4D-controllable video diffusion framework that explicitly decouples scene dynamics from camera pose, enabling fine-grained manipulation of both scene dynamics and camera viewpoint. Our framework takes continuous world-time sequences and camera trajectories as conditioning inputs, injecting them into the video diffusion model through a 4D positional encoding in the attention layer and adaptive normalizations for feature modulation. To train this model, we curate a unique dataset in which temporal and camera variations are independently parameterized; this dataset will be made public. Experiments show that our model achieves robust real-world 4D control across diverse timing patterns and camera trajectories, while preserving high generation quality and outperforming prior work in controllability. See our website for video results: https://19reborn.github.io/Bullet4D/
Related papers
- VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control [83.92729346325163]
VerseCrafter is a 4D-aware video world model that enables explicit and coherent control over both camera and object dynamics.<n>Our approach is centered on a novel 4D Geometric Control representation, which encodes the world state through a static background point cloud.<n>These 4D controls are rendered into conditioning signals for a pretrained video diffusion model, enabling the generation of high-fidelity, view-consistent videos.
arXiv Detail & Related papers (2026-01-08T17:28:52Z) - Pixel-to-4D: Camera-Controlled Image-to-Video Generation with Dynamic 3D Gaussians [7.051077403685518]
Humans excel at forecasting the future dynamics of a scene given just a single image.<n>Video generation models that can mimic this ability are an essential component for intelligent systems.<n>Recent approaches have improved temporal coherence and 3D consistency in single-image-conditioned video generation.
arXiv Detail & Related papers (2026-01-02T13:04:47Z) - Infinite-Homography as Robust Conditioning for Camera-Controlled Video Generation [49.12018869332346]
InfCam is a camera-controlled video-to-video generation framework with high pose fidelity.<n>The framework integrates two key components: (1) infinite homography warping, which encodes 3D camera rotations directly within the 2D latent space of a video diffusion model.
arXiv Detail & Related papers (2025-12-18T20:03:05Z) - SEE4D: Pose-Free 4D Generation via Auto-Regressive Video Inpainting [83.5106058182799]
We introduce SEE4D, a pose-free, trajectory-to-camera framework for 4D world modeling from casual videos.<n>A view-conditional video in model is trained to learn a robust geometry prior to denoising realistically synthesized images.<n>We validate See4D on cross-view video generation and sparse reconstruction benchmarks.
arXiv Detail & Related papers (2025-10-30T17:59:39Z) - CamCtrl3D: Single-Image Scene Exploration with Precise 3D Camera Control [39.20528937415251]
We propose a method for generating fly-through videos of a scene, from a single image and a given camera trajectory.<n>We condition its UNet denoiser on the camera trajectory, using four techniques.<n>We calibrate camera positions in our datasets for scale consistency across scenes, and we train our scene exploration model, CamCtrl3D, demonstrating state-of-theart results.
arXiv Detail & Related papers (2025-01-10T14:37:32Z) - AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers [66.29824750770389]
We analyze camera motion from a first principles perspective, uncovering insights that enable precise 3D camera manipulation.<n>We compound these findings to design the Advanced 3D Camera Control (AC3D) architecture.
arXiv Detail & Related papers (2024-11-27T18:49:13Z) - VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control [74.5434726968562]
We show how to tame transformers video for 3D camera control using a ControlNet-like conditioning mechanism.<n>Our work is the first to enable camera control for transformer-based video diffusion models.
arXiv Detail & Related papers (2024-07-17T17:59:05Z) - Controlling Space and Time with Diffusion Models [34.7002868116714]
We present 4DiM, a cascaded diffusion model for 4D novel view synthesis (NVS)<n>We enable training on a mixture of 3D (with camera pose), 4D (pose+time) and video (time but no pose) data.<n>4DiM is the first-ever NVS method with intuitive metric-scale camera pose control.
arXiv Detail & Related papers (2024-07-10T17:23:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.