3D Scene Prompting for Scene-Consistent Camera-Controllable Video Generation
- URL: http://arxiv.org/abs/2510.14945v1
- Date: Thu, 16 Oct 2025 17:55:25 GMT
- Title: 3D Scene Prompting for Scene-Consistent Camera-Controllable Video Generation
- Authors: JoungBin Lee, Jaewoo Jung, Jisang Han, Takuya Narihira, Kazumi Fukuda, Junyoung Seo, Sunghwan Hong, Yuki Mitsufuji, Seungryong Kim,
- Abstract summary: 3DScenePrompt is a framework that generates the next chunk from arbitrary-length input.<n>It enables camera control and preserving scene consistency.<n>Our framework significantly outperforms existing methods in scene consistency, camera controllability, and generation quality.
- Score: 55.29423122177883
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present 3DScenePrompt, a framework that generates the next video chunk from arbitrary-length input while enabling precise camera control and preserving scene consistency. Unlike methods conditioned on a single image or a short clip, we employ dual spatio-temporal conditioning that reformulates context-view referencing across the input video. Our approach conditions on both temporally adjacent frames for motion continuity and spatially adjacent content for scene consistency. However, when generating beyond temporal boundaries, directly using spatially adjacent frames would incorrectly preserve dynamic elements from the past. We address this by introducing a 3D scene memory that represents exclusively the static geometry extracted from the entire input video. To construct this memory, we leverage dynamic SLAM with our newly introduced dynamic masking strategy that explicitly separates static scene geometry from moving elements. The static scene representation can then be projected to any target viewpoint, providing geometrically consistent warped views that serve as strong 3D spatial prompts while allowing dynamic regions to evolve naturally from temporal context. This enables our model to maintain long-range spatial coherence and precise camera control without sacrificing computational efficiency or motion realism. Extensive experiments demonstrate that our framework significantly outperforms existing methods in scene consistency, camera controllability, and generation quality. Project page : https://cvlab-kaist.github.io/3DScenePrompt/
Related papers
- CineScene: Implicit 3D as Effective Scene Representation for Cinematic Video Generation [65.03946626081036]
We present CineScene, a framework that leverages implicit 3D-aware scene representation for cinematic video generation.<n>CineScene achieves state-of-the-art performance in scene-consistent cinematic video generation.
arXiv Detail & Related papers (2026-02-06T18:59:24Z) - RoamScene3D: Immersive Text-to-3D Scene Generation via Adaptive Object-aware Roaming [79.81527946524098]
RoamScene3D is a novel framework that bridges the gap between semantic guidance and spatial generation.<n>We employ a vision-language model (VLM) to construct a scene graph that encodes object relations.<n>To mitigate the limitations of static 2D priors, we introduce a Motion-Injected Inpainting model that is fine-tuned on a synthetic panoramic dataset.
arXiv Detail & Related papers (2026-01-27T10:10:55Z) - Pixel-to-4D: Camera-Controlled Image-to-Video Generation with Dynamic 3D Gaussians [7.051077403685518]
Humans excel at forecasting the future dynamics of a scene given just a single image.<n>Video generation models that can mimic this ability are an essential component for intelligent systems.<n>Recent approaches have improved temporal coherence and 3D consistency in single-image-conditioned video generation.
arXiv Detail & Related papers (2026-01-02T13:04:47Z) - WorldWarp: Propagating 3D Geometry with Asynchronous Video Diffusion [78.20778143251171]
WorldWarp is a framework that couples a 3D structural anchor with a 2D generative refiner.<n>WorldWarp maintains consistency across video chunks by dynamically updating the 3D cache at every step.<n>It achieves state-of-the-art fidelity by ensuring that 3D logic guides structure while diffusion logic perfects texture.
arXiv Detail & Related papers (2025-12-22T18:53:50Z) - Spatia: Video Generation with Updatable Spatial Memory [60.21619361473996]
Spatia is a spatial memory-aware video generation framework that preserves a 3D scene point cloud as persistent spatial memory.<n>Spatia iteratively generates video clips conditioned on this spatial memory and continuously updates it through visual SLAM.<n>Spatia enables applications such as explicit camera control and 3D-aware interactive editing, providing a geometrically grounded framework for scalable, memory-driven video generation.
arXiv Detail & Related papers (2025-12-17T18:59:59Z) - VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning [38.89828994130979]
We introduce the task of arbitrary-temporal video completion, where a video is generated from arbitrary, user-specified patches placed at any location and akin to painting on a video canvas.<n>This flexible unifies many existing controllable video generation tasks--including first-frame image-to-video, the inpainting, extension, and cohesive--under a single paradigm.<n>We develop VideoCanvas, a novel framework that adapts the In-Context Conditioning (ICC) paradigm to this fine-grained control task with zero new parameters.
arXiv Detail & Related papers (2025-10-09T17:58:59Z) - Voyaging into Perpetual Dynamic Scenes from a Single View [31.85867311855001]
Key challenge is to ensure that different generated views be consistent with the underlying 3D motions.<n>We propose DynamicVoyager, which reformulates dynamic scene generation as a scene outpainting problem with new dynamic content.<n> Experiments show that our model can generate perpetual scenes with consistent motions along fly-through cameras.
arXiv Detail & Related papers (2025-07-05T22:49:25Z) - GaVS: 3D-Grounded Video Stabilization via Temporally-Consistent Local Reconstruction and Rendering [54.489285024494855]
Video stabilization is pivotal for video processing, as it removes unwanted shakiness while preserving the original user motion intent.<n>Existing approaches, depending on the domain they operate, suffer from several issues that degrade the user experience.<n>We introduce textbfGaVS, a novel 3D-grounded approach that reformulates video stabilization as a temporally-consistent local reconstruction and rendering' paradigm.
arXiv Detail & Related papers (2025-06-30T15:24:27Z) - VMem: Consistent Interactive Video Scene Generation with Surfel-Indexed View Memory [55.73900731190389]
We introduce Surfel-Indexed View Memory (VMem), a memory module that remembers past views by indexing them geometrically based on the 3D surface elements (surfels) they have observed.<n>VMem enables efficient retrieval of the most relevant past views when generating new ones.
arXiv Detail & Related papers (2025-06-23T17:59:56Z) - Scene Splatter: Momentum 3D Scene Generation from Single Image with Video Diffusion Model [14.775908473190684]
Scene Splatter is a momentum-based paradigm for video diffusion to generate generic scenes from single image.<n>We construct noisy samples from original features as momentum to enhance video details and maintain scene consistency.<n>Our cascaded momentum enables video diffusion models to generate both high-fidelity and consistent novel views.
arXiv Detail & Related papers (2025-04-03T17:00:44Z) - Look Outside the Room: Synthesizing A Consistent Long-Term 3D Scene
Video from A Single Image [8.13564646389987]
We propose a novel approach to synthesize a consistent long-term video given a single scene image and a trajectory of large camera motions.
Our method outperforms state-of-the-art view synthesis approaches by a large margin.
arXiv Detail & Related papers (2022-03-17T17:16:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.