Text2Stereo: Repurposing Stable Diffusion for Stereo Generation with Consistency Rewards
- URL: http://arxiv.org/abs/2506.05367v2
- Date: Wed, 23 Jul 2025 14:25:59 GMT
- Title: Text2Stereo: Repurposing Stable Diffusion for Stereo Generation with Consistency Rewards
- Authors: Aakash Garg, Libing Zeng, Andrii Tsarov, Nima Khademi Kalantari,
- Abstract summary: We propose a novel diffusion-based approach to generate stereo images given a text prompt.<n> Comprehensive experiments demonstrate the superiority of our approach in generating high-quality stereo images.
- Score: 5.029575650441432
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we propose a novel diffusion-based approach to generate stereo images given a text prompt. Since stereo image datasets with large baselines are scarce, training a diffusion model from scratch is not feasible. Therefore, we propose leveraging the strong priors learned by Stable Diffusion and fine-tuning it on stereo image datasets to adapt it to the task of stereo generation. To improve stereo consistency and text-to-image alignment, we further tune the model using prompt alignment and our proposed stereo consistency reward functions. Comprehensive experiments demonstrate the superiority of our approach in generating high-quality stereo images across diverse scenarios, outperforming existing methods.
Related papers
- Boosting Omnidirectional Stereo Matching with a Pre-trained Depth Foundation Model [62.37493746544967]
Camera-based setups offer a cost-effective option by using stereo depth estimation to generate dense, high-resolution depth maps.<n>Existing omnidirectional stereo matching approaches achieve only limited depth accuracy across diverse environments.<n>We present DFI-OmniStereo, a novel omnidirectional stereo matching method that leverages a large-scale pre-trained foundation model for relative monocular depth estimation.
arXiv Detail & Related papers (2025-03-30T16:24:22Z) - Mono2Stereo: A Benchmark and Empirical Study for Stereo Conversion [88.67015254278859]
We introduce the Mono2Stereo dataset, providing high-quality training data and benchmark to support in-depth exploration of stereo conversion.<n>We conduct an empirical study that yields two primary findings. 1) The differences between the left and right views are subtle, yet existing metrics consider overall pixels, failing to concentrate on regions critical to stereo effects.<n>We introduce a new evaluation metric, Stereo Intersection-over-Union, which harmonizes disparity and achieves a high correlation with human judgments on stereo effect.
arXiv Detail & Related papers (2025-03-28T09:25:58Z) - GenStereo: Towards Open-World Generation of Stereo Images and Unsupervised Matching [9.322869042942504]
GenStereo is a diffusion-based approach to stereo image generation.<n>It achieves both visual quality for viewing and geometric accuracy for matching.<n>Our framework eliminates the need for complex hardware setups while enabling high-quality stereo image generation.
arXiv Detail & Related papers (2025-03-17T01:19:28Z) - ZeroStereo: Zero-shot Stereo Matching from Single Images [17.560148513475387]
We propose ZeroStereo, a novel stereo image generation pipeline for zero-shot stereo matching.<n>Our approach synthesizes high-quality right images by leveraging pseudo disparities generated by a monocular depth estimation model.<n>Our pipeline achieves state-of-the-art zero-shot generalization across multiple datasets with only a dataset volume comparable to Scene Flow.
arXiv Detail & Related papers (2025-01-15T08:43:48Z) - Stereo Anything: Unifying Stereo Matching with Large-Scale Mixed Data [26.029499450825092]
We introduce StereoAnything, a solution for robust stereo matching.<n>We scale up the dataset by collecting labeled stereo images and generating synthetic stereo pairs from unlabeled monocular images.<n>We extensively evaluate the zero-shot capabilities of our model on five public datasets.
arXiv Detail & Related papers (2024-11-21T11:59:04Z) - MaDis-Stereo: Enhanced Stereo Matching via Distilled Masked Image Modeling [18.02254687807291]
Transformer-based stereo models have been studied recently, their performance still lags behind CNN-based stereo models due to the inherent data scarcity issue in the stereo matching task.
We propose Masked Image Modeling Distilled Stereo matching model, termed MaDis-Stereo, that enhances locality inductive bias by leveraging Masked Image Modeling (MIM) in training Transformer-based stereo model.
arXiv Detail & Related papers (2024-09-04T16:17:45Z) - StereoDiffusion: Training-Free Stereo Image Generation Using Latent Diffusion Models [2.9260206957981167]
We introduce StereoDiffusion, a method that is trainning free, remarkably straightforward to use, and seamlessly integrates into the original Stable Diffusion model.
Our method modifies the latent variable to provide an end-to-end, lightweight capability for fast generation of stereo image pairs.
Our proposed method maintains a high standard of image quality throughout the stereo generation process, achieving state-of-the-art scores in various quantitative evaluations.
arXiv Detail & Related papers (2024-03-08T00:30:25Z) - DynamicStereo: Consistent Dynamic Depth from Stereo Videos [91.1804971397608]
We propose DynamicStereo to estimate disparity for stereo videos.
The network learns to pool information from neighboring frames to improve the temporal consistency of its predictions.
We also introduce Dynamic Replica, a new benchmark dataset containing synthetic videos of people and animals in scanned environments.
arXiv Detail & Related papers (2023-05-03T17:40:49Z) - Single-View View Synthesis with Self-Rectified Pseudo-Stereo [49.946151180828465]
We leverage the reliable and explicit stereo prior to generate a pseudo-stereo viewpoint.
We propose a self-rectified stereo synthesis to amend erroneous regions in an identify-rectify manner.
Our method outperforms state-of-the-art single-view view synthesis methods and stereo synthesis methods.
arXiv Detail & Related papers (2023-04-19T09:36:13Z) - SMD-Nets: Stereo Mixture Density Networks [68.56947049719936]
We propose Stereo Mixture Density Networks (SMD-Nets), a simple yet effective learning framework compatible with a wide class of 2D and 3D architectures.
Specifically, we exploit bimodal mixture densities as output representation and show that this allows for sharp and precise disparity estimates near discontinuities.
We carry out comprehensive experiments on a new high-resolution and highly realistic synthetic stereo dataset, consisting of stereo pairs at 8Mpx resolution, as well as on real-world stereo datasets.
arXiv Detail & Related papers (2021-04-08T16:15:46Z) - Expanding Sparse Guidance for Stereo Matching [24.74333370941674]
We propose a novel sparsity expansion technique to expand the sparse cues concerning RGB images for local feature enhancement.
Our approach significantly boosts the existing state-of-the-art stereo algorithms with extremely sparse cues.
arXiv Detail & Related papers (2020-04-24T06:41:11Z) - AdaStereo: A Simple and Efficient Approach for Adaptive Stereo Matching [50.06646151004375]
A novel domain-adaptive pipeline called AdaStereo aims to align multi-level representations for deep stereo matching networks.
Our AdaStereo models achieve state-of-the-art cross-domain performance on multiple stereo benchmarks, including KITTI, Middlebury, ETH3D, and DrivingStereo.
arXiv Detail & Related papers (2020-04-09T16:15:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.