SpatialMe: Stereo Video Conversion Using Depth-Warping and Blend-Inpainting
- URL: http://arxiv.org/abs/2412.11512v1
- Date: Mon, 16 Dec 2024 07:42:49 GMT
- Title: SpatialMe: Stereo Video Conversion Using Depth-Warping and Blend-Inpainting
- Authors: Jiale Zhang, Qianxi Jia, Yang Liu, Wei Zhang, Wei Wei, Xin Tian,
- Abstract summary: We introduce SpatialMe, a novel stereo video conversion framework based on depth-warping and blend-inpainting.
We conduct a high-quality real-world stereo video dataset -- StereoV1K, to alleviate the data shortage.
- Score: 20.98704347305053
- License:
- Abstract: Stereo video conversion aims to transform monocular videos into immersive stereo format. Despite the advancements in novel view synthesis, it still remains two major challenges: i) difficulty of achieving high-fidelity and stable results, and ii) insufficiency of high-quality stereo video data. In this paper, we introduce SpatialMe, a novel stereo video conversion framework based on depth-warping and blend-inpainting. Specifically, we propose a mask-based hierarchy feature update (MHFU) refiner, which integrate and refine the outputs from designed multi-branch inpainting module, using feature update unit (FUU) and mask mechanism. We also propose a disparity expansion strategy to address the problem of foreground bleeding. Furthermore, we conduct a high-quality real-world stereo video dataset -- StereoV1K, to alleviate the data shortage. It contains 1000 stereo videos captured in real-world at a resolution of 1180 x 1180, covering various indoor and outdoor scenes. Extensive experiments demonstrate the superiority of our approach in generating stereo videos over state-of-the-art methods.
Related papers
- T-SVG: Text-Driven Stereoscopic Video Generation [87.62286959918566]
This paper introduces the Text-driven Stereoscopic Video Generation (T-SVG) system.
It streamlines video generation by using text prompts to create reference videos.
These videos are transformed into 3D point cloud sequences, which are rendered from two perspectives with subtle parallax differences.
arXiv Detail & Related papers (2024-12-12T14:48:46Z) - StereoCrafter-Zero: Zero-Shot Stereo Video Generation with Noisy Restart [45.27524689977587]
We introduce textitStereoCrafter-Zero, a novel framework for zero-shot stereo video generation.
Key innovations include a noisy restart strategy to initialize stereo-aware latents and an iterative refinement process.
Our framework is robust and adaptable across various diffusion models, setting a new benchmark for zero-shot stereo video generation.
arXiv Detail & Related papers (2024-11-21T16:41:55Z) - ImmersePro: End-to-End Stereo Video Synthesis Via Implicit Disparity Learning [43.105154507379076]
textitImmersePro is a framework specifically designed to transform single-view videos into stereo videos.
textitImmersePro employs implicit disparity guidance, enabling the generation of stereo pairs from video sequences without the need for explicit disparity maps.
Our experiments demonstrate the effectiveness of textitImmersePro in producing high-quality stereo videos, offering significant improvements over existing methods.
arXiv Detail & Related papers (2024-09-30T22:19:32Z) - StereoCrafter: Diffusion-based Generation of Long and High-fidelity Stereoscopic 3D from Monocular Videos [44.51044100125421]
This paper presents a novel framework for converting 2D videos to immersive stereoscopic 3D, addressing the growing demand for 3D content in immersive experience.
Our framework demonstrates significant improvements in 2D-to-3D video conversion, offering a practical solution for creating immersive content for 3D devices like Apple Vision Pro and 3D displays.
arXiv Detail & Related papers (2024-09-11T17:52:07Z) - Stereo Matching in Time: 100+ FPS Video Stereo Matching for Extended
Reality [65.70936336240554]
Real-time Stereo Matching is a cornerstone algorithm for many Extended Reality (XR) applications, such as indoor 3D understanding, video pass-through, and mixed-reality games.
One of the major difficulties is the lack of high-quality indoor video stereo training datasets captured by head-mounted VR/AR glasses.
We introduce a novel video stereo synthetic dataset that comprises renderings of various indoor scenes and realistic camera motion captured by a 6-DoF moving VR/AR head-mounted display (HMD).
This facilitates the evaluation of existing approaches and promotes further research on indoor augmented reality scenarios.
arXiv Detail & Related papers (2023-09-08T07:53:58Z) - DynamicStereo: Consistent Dynamic Depth from Stereo Videos [91.1804971397608]
We propose DynamicStereo to estimate disparity for stereo videos.
The network learns to pool information from neighboring frames to improve the temporal consistency of its predictions.
We also introduce Dynamic Replica, a new benchmark dataset containing synthetic videos of people and animals in scanned environments.
arXiv Detail & Related papers (2023-05-03T17:40:49Z) - Deep 3D Mask Volume for View Synthesis of Dynamic Scenes [49.45028543279115]
We introduce a multi-view video dataset, captured with a custom 10-camera rig in 120FPS.
The dataset contains 96 high-quality scenes showing various visual effects and human interactions in outdoor scenes.
We develop a new algorithm, Deep 3D Mask Volume, which enables temporally-stable view extrapolation from binocular videos of dynamic scenes, captured by static cameras.
arXiv Detail & Related papers (2021-08-30T17:55:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.