ImmersePro: End-to-End Stereo Video Synthesis Via Implicit Disparity Learning
- URL: http://arxiv.org/abs/2410.00262v1
- Date: Mon, 30 Sep 2024 22:19:32 GMT
- Title: ImmersePro: End-to-End Stereo Video Synthesis Via Implicit Disparity Learning
- Authors: Jian Shi, Zhenyu Li, Peter Wonka,
- Abstract summary: textitImmersePro is a framework specifically designed to transform single-view videos into stereo videos.
textitImmersePro employs implicit disparity guidance, enabling the generation of stereo pairs from video sequences without the need for explicit disparity maps.
Our experiments demonstrate the effectiveness of textitImmersePro in producing high-quality stereo videos, offering significant improvements over existing methods.
- Score: 43.105154507379076
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce \textit{ImmersePro}, an innovative framework specifically designed to transform single-view videos into stereo videos. This framework utilizes a novel dual-branch architecture comprising a disparity branch and a context branch on video data by leveraging spatial-temporal attention mechanisms. \textit{ImmersePro} employs implicit disparity guidance, enabling the generation of stereo pairs from video sequences without the need for explicit disparity maps, thus reducing potential errors associated with disparity estimation models. In addition to the technical advancements, we introduce the YouTube-SBS dataset, a comprehensive collection of 423 stereo videos sourced from YouTube. This dataset is unprecedented in its scale, featuring over 7 million stereo pairs, and is designed to facilitate training and benchmarking of stereo video generation models. Our experiments demonstrate the effectiveness of \textit{ImmersePro} in producing high-quality stereo videos, offering significant improvements over existing methods. Compared to the best competitor stereo-from-mono we quantitatively improve the results by 11.76\% (L1), 6.39\% (SSIM), and 5.10\% (PSNR).
Related papers
- Generative Video Semantic Communication via Multimodal Semantic Fusion with Large Model [55.71885688565501]
We propose a scalable generative video semantic communication framework that extracts and transmits semantic information to achieve high-quality video reconstruction.
Specifically, at the transmitter, description and other condition signals are extracted from the source video, functioning as text and structural semantics, respectively.
At the receiver, the diffusion-based GenAI large models are utilized to fuse the semantics of the multiple modalities for reconstructing the video.
arXiv Detail & Related papers (2025-02-19T15:59:07Z) - SpatialMe: Stereo Video Conversion Using Depth-Warping and Blend-Inpainting [20.98704347305053]
We introduce SpatialMe, a novel stereo video conversion framework based on depth-warping and blend-inpainting.
We conduct a high-quality real-world stereo video dataset -- StereoV1K, to alleviate the data shortage.
arXiv Detail & Related papers (2024-12-16T07:42:49Z) - Video Set Distillation: Information Diversification and Temporal Densification [68.85010825225528]
Video textbfsets have two dimensions of redundancies: within-sample and inter-sample redundancies.
We are the first to study Video Set Distillation, which synthesizes optimized video data by addressing within-sample and inter-sample redundancies.
arXiv Detail & Related papers (2024-11-28T05:37:54Z) - StereoCrafter-Zero: Zero-Shot Stereo Video Generation with Noisy Restart [45.27524689977587]
We introduce textitStereoCrafter-Zero, a novel framework for zero-shot stereo video generation.
Key innovations include a noisy restart strategy to initialize stereo-aware latents and an iterative refinement process.
Our framework is robust and adaptable across various diffusion models, setting a new benchmark for zero-shot stereo video generation.
arXiv Detail & Related papers (2024-11-21T16:41:55Z) - Stereo Anything: Unifying Stereo Matching with Large-Scale Mixed Data [26.029499450825092]
We introduce StereoAnything, a solution for robust stereo matching.
We scale up the dataset by collecting labeled stereo images and generating synthetic stereo pairs from unlabeled monocular images.
We extensively evaluate the zero-shot capabilities of our model on five public datasets.
arXiv Detail & Related papers (2024-11-21T11:59:04Z) - Match Stereo Videos via Bidirectional Alignment [15.876953256378224]
Recent learning-based methods often focus on optimizing performance for independent stereo pairs, leading to temporal inconsistencies in videos.
We introduce a novel video processing framework, BiDAStereo, and a plugin stabilizer network, BiDAStabilizer, compatible with general image-based methods.
We present a realistic synthetic dataset and benchmark focused on natural scenes, along with a real-world dataset captured by a stereo camera in diverse urban scenes for qualitative evaluation.
arXiv Detail & Related papers (2024-09-30T13:37:29Z) - Video DataFlywheel: Resolving the Impossible Data Trinity in Video-Language Understanding [61.89781979702939]
This study quantitatively reveals an "impossible trinity" among data quantity, diversity, and quality in pre-training datasets.
Recent efforts seek to refine large-scale, diverse ASR datasets compromised by low quality through synthetic annotations.
We introduce the Video DataFlywheel framework, which iteratively refines video annotations with improved noise control methods.
arXiv Detail & Related papers (2024-09-29T03:33:35Z) - Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion
Latent Aligners [69.70590867769408]
Video and audio content creation serves as the core technique for the movie industry and professional users.
Existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry.
In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation.
arXiv Detail & Related papers (2024-02-27T17:57:04Z) - DynamicStereo: Consistent Dynamic Depth from Stereo Videos [91.1804971397608]
We propose DynamicStereo to estimate disparity for stereo videos.
The network learns to pool information from neighboring frames to improve the temporal consistency of its predictions.
We also introduce Dynamic Replica, a new benchmark dataset containing synthetic videos of people and animals in scanned environments.
arXiv Detail & Related papers (2023-05-03T17:40:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.