StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation
- URL: http://arxiv.org/abs/2512.09363v2
- Date: Thu, 11 Dec 2025 15:59:50 GMT
- Title: StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation
- Authors: Ke Xing, Xiaojie Jin, Longfei Li, Yuyang Yin, Hanwen Liang, Guixun Luo, Chen Fang, Jue Wang, Konstantinos N. Plataniotis, Yao Zhao, Yunchao Wei,
- Abstract summary: StereoWorld is an end-to-end framework for high-fidelity monocular-to-stereo video generation.<n>Our framework conditions the model on the monocular video input while explicitly supervising the generation with a geometry-aware regularization.<n>To enable large-scale training and evaluation, we curate a high-definition stereo video dataset.
- Score: 108.97993219426509
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: The growing adoption of XR devices has fueled strong demand for high-quality stereo video, yet its production remains costly and artifact-prone. To address this challenge, we present StereoWorld, an end-to-end framework that repurposes a pretrained video generator for high-fidelity monocular-to-stereo video generation. Our framework jointly conditions the model on the monocular video input while explicitly supervising the generation with a geometry-aware regularization to ensure 3D structural fidelity. A spatio-temporal tiling scheme is further integrated to enable efficient, high-resolution synthesis. To enable large-scale training and evaluation, we curate a high-definition stereo video dataset containing over 11M frames aligned to natural human interpupillary distance (IPD). Extensive experiments demonstrate that StereoWorld substantially outperforms prior methods, generating stereo videos with superior visual fidelity and geometric consistency. The project webpage is available at https://ke-xing.github.io/StereoWorld/.
Related papers
- WorldStereo: Bridging Camera-Guided Video Generation and Scene Reconstruction via 3D Geometric Memories [36.79437857022868]
WorldStereo is a novel framework that bridges camera-guided video generation and 3D reconstruction.<n>We show that WorldStereo acts as a powerful world model, tackling diverse scene generation tasks with high-fidelity 3D results.
arXiv Detail & Related papers (2026-03-02T16:36:56Z) - StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors [41.34827274890319]
We introduce UniStereo, the first large-scale unified dataset for stereo video conversion.<n>We propose StereoPilot, an efficient feed-forward model that directly synthesizes the target view without relying on explicit depth maps.<n>Experiments demonstrate that StereoPilot significantly outperforms state-of-the-art methods in both visual fidelity and computational efficiency.
arXiv Detail & Related papers (2025-12-18T18:59:50Z) - Endless World: Real-Time 3D-Aware Long Video Generation [57.411689597435334]
Endless World is a real-time framework for infinite, 3D-consistent video generation.<n>We introduce a conditional autoregressive training strategy that aligns newly generated content with existing video frames.<n>Our 3D injection mechanism enforces physical plausibility and geometric consistency throughout extended sequences.
arXiv Detail & Related papers (2025-12-13T19:06:12Z) - S^2VG: 3D Stereoscopic and Spatial Video Generation via Denoising Frame Matrix [60.060882467801484]
We present a pose-free and training-free method that leverages an off-the-shelf monocular video generation model to produce immersive 3D videos.<n>Our approach first warps the generated monocular video into pre-defined camera viewpoints using estimated depth information, then applies a novel textitframe matrix inpainting framework.<n>We validate the efficacy of our proposed method by conducting experiments on videos from various generative models, such as Sora, Lumiere, WALT, and Zeroscope.
arXiv Detail & Related papers (2025-08-11T14:50:03Z) - Restereo: Diffusion stereo video generation and restoration [43.208256051997616]
We introduce a new pipeline that not only generates stereo videos but also enhances both left-view and right-view videos consistently with a single model.<n>Our method can be fine-tuned on a relatively small synthetic stereo video datasets and applied to low-quality real-world videos.
arXiv Detail & Related papers (2025-06-06T12:14:24Z) - SpatialMe: Stereo Video Conversion Using Depth-Warping and Blend-Inpainting [20.98704347305053]
We introduce SpatialMe, a novel stereo video conversion framework based on depth-warping and blend-inpainting.<n>We conduct a high-quality real-world stereo video dataset -- StereoV1K, to alleviate the data shortage.
arXiv Detail & Related papers (2024-12-16T07:42:49Z) - T-SVG: Text-Driven Stereoscopic Video Generation [87.62286959918566]
This paper introduces the Text-driven Stereoscopic Video Generation (T-SVG) system.<n>It streamlines video generation by using text prompts to create reference videos.<n>These videos are transformed into 3D point cloud sequences, which are rendered from two perspectives with subtle parallax differences.
arXiv Detail & Related papers (2024-12-12T14:48:46Z) - StereoCrafter-Zero: Zero-Shot Stereo Video Generation with Noisy Restart [44.671043951223574]
We introduce StereoCrafter-Zero, a novel framework for zero-shot stereo video generation.<n>Key innovations include a noisy restart strategy to initialize stereo-aware latent representations.<n>We show that StereoCrafter-Zero produces high-quality stereo videos with enhanced depth consistency and temporal smoothness.
arXiv Detail & Related papers (2024-11-21T16:41:55Z) - Stereo Anything: Unifying Zero-shot Stereo Matching with Large-Scale Mixed Data [77.27700893908012]
Stereo matching serves as a cornerstone in 3D vision, aiming to establish pixel-wise correspondences between stereo image pairs for depth recovery.<n>Current models often exhibit severe performance degradation when deployed in unseen domains.<n>We introduce StereoAnything, a data-centric framework that substantially enhances the zero-shot generalization capability of existing stereo models.
arXiv Detail & Related papers (2024-11-21T11:59:04Z) - SpatialDreamer: Self-supervised Stereo Video Synthesis from Monocular Input [6.275971782566314]
We introduce a novel self-supervised stereo synthesis video paradigm via a video diffusion model, termed SpatialDreamer.<n>To address the stereo video data insufficiency, we propose a Depth based Video Generation module DVG.<n>We also propose RefinerNet along with a self-supervised synthetic framework designed to facilitate efficient and dedicated training.
arXiv Detail & Related papers (2024-11-18T15:12:59Z) - StereoCrafter: Diffusion-based Generation of Long and High-fidelity Stereoscopic 3D from Monocular Videos [44.51044100125421]
This paper presents a novel framework for converting 2D videos to immersive stereoscopic 3D, addressing the growing demand for 3D content in immersive experience.
Our framework demonstrates significant improvements in 2D-to-3D video conversion, offering a practical solution for creating immersive content for 3D devices like Apple Vision Pro and 3D displays.
arXiv Detail & Related papers (2024-09-11T17:52:07Z) - DynamicStereo: Consistent Dynamic Depth from Stereo Videos [91.1804971397608]
We propose DynamicStereo to estimate disparity for stereo videos.
The network learns to pool information from neighboring frames to improve the temporal consistency of its predictions.
We also introduce Dynamic Replica, a new benchmark dataset containing synthetic videos of people and animals in scanned environments.
arXiv Detail & Related papers (2023-05-03T17:40:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.