ImmersePro: End-to-End Stereo Video Synthesis Via Implicit Disparity Learning
- URL: http://arxiv.org/abs/2410.00262v1
- Date: Mon, 30 Sep 2024 22:19:32 GMT
- Title: ImmersePro: End-to-End Stereo Video Synthesis Via Implicit Disparity Learning
- Authors: Jian Shi, Zhenyu Li, Peter Wonka,
- Abstract summary: textitImmersePro is a framework specifically designed to transform single-view videos into stereo videos.
textitImmersePro employs implicit disparity guidance, enabling the generation of stereo pairs from video sequences without the need for explicit disparity maps.
Our experiments demonstrate the effectiveness of textitImmersePro in producing high-quality stereo videos, offering significant improvements over existing methods.
- Score: 43.105154507379076
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce \textit{ImmersePro}, an innovative framework specifically designed to transform single-view videos into stereo videos. This framework utilizes a novel dual-branch architecture comprising a disparity branch and a context branch on video data by leveraging spatial-temporal attention mechanisms. \textit{ImmersePro} employs implicit disparity guidance, enabling the generation of stereo pairs from video sequences without the need for explicit disparity maps, thus reducing potential errors associated with disparity estimation models. In addition to the technical advancements, we introduce the YouTube-SBS dataset, a comprehensive collection of 423 stereo videos sourced from YouTube. This dataset is unprecedented in its scale, featuring over 7 million stereo pairs, and is designed to facilitate training and benchmarking of stereo video generation models. Our experiments demonstrate the effectiveness of \textit{ImmersePro} in producing high-quality stereo videos, offering significant improvements over existing methods. Compared to the best competitor stereo-from-mono we quantitatively improve the results by 11.76\% (L1), 6.39\% (SSIM), and 5.10\% (PSNR).
Related papers
- StereoCrafter-Zero: Zero-Shot Stereo Video Generation with Noisy Restart [45.27524689977587]
We introduce textitStereoCrafter-Zero, a novel framework for zero-shot stereo video generation.
Key innovations include a noisy restart strategy to initialize stereo-aware latents and an iterative refinement process.
Our framework is robust and adaptable across various diffusion models, setting a new benchmark for zero-shot stereo video generation.
arXiv Detail & Related papers (2024-11-21T16:41:55Z) - Stereo Anything: Unifying Stereo Matching with Large-Scale Mixed Data [26.029499450825092]
We introduce StereoAnything, a solution for robust stereo matching.
We scale up the dataset by collecting labeled stereo images and generating synthetic stereo pairs from unlabeled monocular images.
We extensively evaluate the zero-shot capabilities of our model on five public datasets.
arXiv Detail & Related papers (2024-11-21T11:59:04Z) - SpatialDreamer: Self-supervised Stereo Video Synthesis from Monocular Input [6.275971782566314]
We introduce a novel self-supervised stereo synthesis video paradigm via a video diffusion model, termed SpatialDreamer.
To address the stereo video data insufficiency, we propose a Depth based Video Generation module DVG.
We also propose RefinerNet along with a self-supervised synthetic framework designed to facilitate efficient and dedicated training.
arXiv Detail & Related papers (2024-11-18T15:12:59Z) - Match Stereo Videos via Bidirectional Alignment [15.876953256378224]
Recent learning-based methods often focus on optimizing performance for independent stereo pairs, leading to temporal inconsistencies in videos.
We introduce a novel video processing framework, BiDAStereo, and a plugin stabilizer network, BiDAStabilizer, compatible with general image-based methods.
We present a realistic synthetic dataset and benchmark focused on natural scenes, along with a real-world dataset captured by a stereo camera in diverse urban scenes for qualitative evaluation.
arXiv Detail & Related papers (2024-09-30T13:37:29Z) - Video DataFlywheel: Resolving the Impossible Data Trinity in Video-Language Understanding [61.89781979702939]
This study quantitatively reveals an "impossible trinity" among data quantity, diversity, and quality in pre-training datasets.
Recent efforts seek to refine large-scale, diverse ASR datasets compromised by low quality through synthetic annotations.
We introduce the Video DataFlywheel framework, which iteratively refines video annotations with improved noise control methods.
arXiv Detail & Related papers (2024-09-29T03:33:35Z) - Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion
Latent Aligners [69.70590867769408]
Video and audio content creation serves as the core technique for the movie industry and professional users.
Existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry.
In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation.
arXiv Detail & Related papers (2024-02-27T17:57:04Z) - DynamicStereo: Consistent Dynamic Depth from Stereo Videos [91.1804971397608]
We propose DynamicStereo to estimate disparity for stereo videos.
The network learns to pool information from neighboring frames to improve the temporal consistency of its predictions.
We also introduce Dynamic Replica, a new benchmark dataset containing synthetic videos of people and animals in scanned environments.
arXiv Detail & Related papers (2023-05-03T17:40:49Z) - Bidirectional Cross-Modal Knowledge Exploration for Video Recognition
with Pre-trained Vision-Language Models [149.1331903899298]
We propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge.
We present a Temporal Concept Spotting mechanism that uses the Text-to-Video expertise to capture temporal saliency in a parameter-free manner.
Our best model achieves a state-of-the-art accuracy of 88.6% on the challenging Kinetics-400 using the released CLIP model.
arXiv Detail & Related papers (2022-12-31T11:36:53Z) - Support-Set Based Cross-Supervision for Video Grounding [98.29089558426399]
Support-set Based Cross-Supervision (Sscs) module can improve existing methods during training phase without extra inference cost.
The proposed Sscs module contains two main components, i.e., discriminative contrastive objective and generative caption objective.
We extensively evaluate Sscs on three challenging datasets, and show that our method can improve current state-of-the-art methods by large margins.
arXiv Detail & Related papers (2021-08-24T08:25:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.