CineScene: Implicit 3D as Effective Scene Representation for Cinematic Video Generation
- URL: http://arxiv.org/abs/2602.06959v1
- Date: Fri, 06 Feb 2026 18:59:24 GMT
- Title: CineScene: Implicit 3D as Effective Scene Representation for Cinematic Video Generation
- Authors: Kaiyi Huang, Yukun Huang, Yu Li, Jianhong Bai, Xintao Wang, Zinan Lin, Xuefei Ning, Jiwen Yu, Pengfei Wan, Yu Wang, Xihui Liu,
- Abstract summary: We present CineScene, a framework that leverages implicit 3D-aware scene representation for cinematic video generation.<n>CineScene achieves state-of-the-art performance in scene-consistent cinematic video generation.
- Score: 65.03946626081036
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cinematic video production requires control over scene-subject composition and camera movement, but live-action shooting remains costly due to the need for constructing physical sets. To address this, we introduce the task of cinematic video generation with decoupled scene context: given multiple images of a static environment, the goal is to synthesize high-quality videos featuring dynamic subject while preserving the underlying scene consistency and following a user-specified camera trajectory. We present CineScene, a framework that leverages implicit 3D-aware scene representation for cinematic video generation. Our key innovation is a novel context conditioning mechanism that injects 3D-aware features in an implicit way: By encoding scene images into visual representations through VGGT, CineScene injects spatial priors into a pretrained text-to-video generation model by additional context concatenation, enabling camera-controlled video synthesis with consistent scenes and dynamic subjects. To further enhance the model's robustness, we introduce a simple yet effective random-shuffling strategy for the input scene images during training. To address the lack of training data, we construct a scene-decoupled dataset with Unreal Engine 5, containing paired videos of scenes with and without dynamic subjects, panoramic images representing the underlying static scene, along with their camera trajectories. Experiments show that CineScene achieves state-of-the-art performance in scene-consistent cinematic video generation, handling large camera movements and demonstrating generalization across diverse environments.
Related papers
- 3D Scene Prompting for Scene-Consistent Camera-Controllable Video Generation [55.29423122177883]
3DScenePrompt is a framework that generates the next chunk from arbitrary-length input.<n>It enables camera control and preserving scene consistency.<n>Our framework significantly outperforms existing methods in scene consistency, camera controllability, and generation quality.
arXiv Detail & Related papers (2025-10-16T17:55:25Z) - Video Perception Models for 3D Scene Synthesis [109.5543506037003]
VIPScene is a novel framework that exploits the encoded commonsense knowledge of the 3D physical world in video generation models.<n>VIPScene seamlessly integrates video generation, feedforward 3D reconstruction, and open-vocabulary perception models to semantically and geometrically analyze each object in a scene.
arXiv Detail & Related papers (2025-06-25T16:40:17Z) - DynVFX: Augmenting Real Videos with Dynamic Content [26.022834306983906]
We present a method for augmenting real-world videos with newly generated dynamic content.<n>Given an input video and a simple user-provided text instruction describing the desired content, our method synthesizes dynamic objects or complex scene effects.<n>The position, appearance, and motion of the new content are seamlessly integrated into the original footage.
arXiv Detail & Related papers (2025-02-05T21:14:55Z) - Generative Camera Dolly: Extreme Monocular Dynamic Novel View Synthesis [43.02778060969546]
We propose a controllable monocular dynamic view synthesis pipeline.
Our model does not require depth as input, and does not explicitly model 3D scene geometry.
We believe our framework can potentially unlock powerful applications in rich dynamic scene understanding, perception for robotics, and interactive 3D video viewing experiences for virtual reality.
arXiv Detail & Related papers (2024-05-23T17:59:52Z) - DynIBaR: Neural Dynamic Image-Based Rendering [79.44655794967741]
We address the problem of synthesizing novel views from a monocular video depicting a complex dynamic scene.
We adopt a volumetric image-based rendering framework that synthesizes new viewpoints by aggregating features from nearby views.
We demonstrate significant improvements over state-of-the-art methods on dynamic scene datasets.
arXiv Detail & Related papers (2022-11-20T20:57:02Z) - NeuralDiff: Segmenting 3D objects that move in egocentric videos [92.95176458079047]
We study the problem of decomposing the observed 3D scene into a static background and a dynamic foreground.
This task is reminiscent of the classic background subtraction problem, but is significantly harder because all parts of the scene, static and dynamic, generate a large apparent motion.
In particular, we consider egocentric videos and further separate the dynamic component into objects and the actor that observes and moves them.
arXiv Detail & Related papers (2021-10-19T12:51:35Z) - Video Autoencoder: self-supervised disentanglement of static 3D
structure and motion [60.58836145375273]
A video autoencoder is proposed for learning disentan- gled representations of 3D structure and camera pose from videos.
The representation can be applied to a range of tasks, including novel view synthesis, camera pose estimation, and video generation by motion following.
arXiv Detail & Related papers (2021-10-06T17:57:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.