M2SVid: End-to-End Inpainting and Refinement for Monocular-to-Stereo Video Conversion
- URL: http://arxiv.org/abs/2505.16565v1
- Date: Thu, 22 May 2025 11:58:54 GMT
- Title: M2SVid: End-to-End Inpainting and Refinement for Monocular-to-Stereo Video Conversion
- Authors: Nina Shvetsova, Goutam Bhat, Prune Truong, Hilde Kuehne, Federico Tombari,
- Abstract summary: We propose a novel architecture for inpainting and refinement of the warped right view obtained by depth-based reprojection of the input left view.<n>Our approach outperforms previous state-of-the-art methods, obtaining an average rank of 1.43 among the 4 compared methods in a user study.
- Score: 60.728003408015844
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We tackle the problem of monocular-to-stereo video conversion and propose a novel architecture for inpainting and refinement of the warped right view obtained by depth-based reprojection of the input left view. We extend the Stable Video Diffusion (SVD) model to utilize the input left video, the warped right video, and the disocclusion masks as conditioning input to generate a high-quality right camera view. In order to effectively exploit information from neighboring frames for inpainting, we modify the attention layers in SVD to compute full attention for discoccluded pixels. Our model is trained to generate the right view video in an end-to-end manner by minimizing image space losses to ensure high-quality generation. Our approach outperforms previous state-of-the-art methods, obtaining an average rank of 1.43 among the 4 compared methods in a user study, while being 6x faster than the second placed method.
Related papers
- Restereo: Diffusion stereo video generation and restoration [43.208256051997616]
We introduce a new pipeline that not only generates stereo videos but also enhances both left-view and right-view videos consistently with a single model.<n>Our method can be fine-tuned on a relatively small synthetic stereo video datasets and applied to low-quality real-world videos.
arXiv Detail & Related papers (2025-06-06T12:14:24Z) - Zero4D: Training-Free 4D Video Generation From Single Video Using Off-the-Shelf Video Diffusion [52.0192865857058]
We propose the first training-free 4D video generation method that leverages the off-the-shelf video diffusion models to generate multi-view videos from a single input video.<n>Our method is training-free and fully utilizes an off-the-shelf video diffusion model, offering a practical and effective solution for multi-view video generation.
arXiv Detail & Related papers (2025-03-28T17:14:48Z) - TrajectoryCrafter: Redirecting Camera Trajectory for Monocular Videos via Diffusion Models [33.219657261649324]
TrajectoryCrafter is a novel approach to redirect camera trajectories for monocular videos.<n>By disentangling deterministic view transformations from content generation, our method achieves precise control over user-specified camera trajectories.
arXiv Detail & Related papers (2025-03-07T17:57:53Z) - Elevating Flow-Guided Video Inpainting with Reference Generation [50.03502211226332]
Video inpainting (VI) is a challenging task that requires effective propagation of observable content across frames while simultaneously generating new content not present in the original video.<n>We propose a robust and practical VI framework that leverages a large generative model for reference generation in combination with an advanced pixel propagation algorithm.<n>Our method not only significantly enhances frame-level quality for object removal but also synthesizes new content in the missing areas based on user-provided text prompts.
arXiv Detail & Related papers (2024-12-12T06:13:00Z) - VISION-XL: High Definition Video Inverse Problem Solver using Latent Image Diffusion Models [58.464465016269614]
We propose a novel framework for solving high-definition video inverse problems using latent image diffusion models.<n>Our approach delivers HD-resolution reconstructions in under 6 seconds per frame on a single NVIDIA 4090 GPU.
arXiv Detail & Related papers (2024-11-29T08:10:49Z) - ViBiDSampler: Enhancing Video Interpolation Using Bidirectional Diffusion Sampler [53.98558445900626]
Current image-to-video diffusion models, while powerful in generating videos from a single frame, need adaptation for two-frame conditioned generation.<n>We introduce a novel, bidirectional sampling strategy to address these off-manifold issues without requiring extensive re-noising or fine-tuning.<n>Our method employs sequential sampling along both forward and backward paths, conditioned on the start and end frames, respectively, ensuring more coherent and on-manifold generation of intermediate frames.
arXiv Detail & Related papers (2024-10-08T03:01:54Z) - Learning Spatial Adaptation and Temporal Coherence in Diffusion Models for Video Super-Resolution [151.1255837803585]
We propose a novel approach, pursuing Spatial Adaptation and Temporal Coherence (SATeCo) for video super-resolution.
SATeCo pivots on learning spatial-temporal guidance from low-resolution videos to calibrate both latent-space high-resolution video denoising and pixel-space video reconstruction.
Experiments conducted on the REDS4 and Vid4 datasets demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2024-03-25T17:59:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.