Related papers: STeP: A General and Scalable Framework for Solving Video Inverse Problems with Spatiotemporal Diffusion Priors

STeP: A General and Scalable Framework for Solving Video Inverse Problems with Spatiotemporal Diffusion Priors

URL: http://arxiv.org/abs/2504.07549v1
Date: Thu, 10 Apr 2025 08:24:26 GMT
Title: STeP: A General and Scalable Framework for Solving Video Inverse Problems with Spatiotemporal Diffusion Priors
Authors: Bingliang Zhang, Zihui Wu, Berthy T. Feng, Yang Song, Yisong Yue, Katherine L. Bouman,
Abstract summary: We study how to solve general inverse problems involving videos using diffusion model priors.<n>We introduce a general and scalable framework for solving video inverse problems.
Score: 27.45644471304381
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study how to solve general Bayesian inverse problems involving videos using diffusion model priors. While it is desirable to use a video diffusion prior to effectively capture complex temporal relationships, due to the computational and data requirements of training such a model, prior work has instead relied on image diffusion priors on single frames combined with heuristics to enforce temporal consistency. However, these approaches struggle with faithfully recovering the underlying temporal relationships, particularly for tasks with high temporal uncertainty. In this paper, we demonstrate the feasibility of practical and accessible spatiotemporal diffusion priors by fine-tuning latent video diffusion models from pretrained image diffusion models using limited videos in specific domains. Leveraging this plug-and-play spatiotemporal diffusion prior, we introduce a general and scalable framework for solving video inverse problems. We then apply our framework to two challenging scientific video inverse problems--black hole imaging and dynamic MRI. Our framework enables the generation of diverse, high-fidelity video reconstructions that not only fit observations but also recover multi-modal solutions. By incorporating a spatiotemporal diffusion prior, we significantly improve our ability to capture complex temporal relationships in the data while also enhancing spatial fidelity.

Related papers

Rethinking Video Tokenization: A Conditioned Diffusion-based Approach [58.164354605550194]
New tokenizer, Diffusion Conditioned-based Gene Tokenizer, replaces GAN-based decoder with conditional diffusion model. We trained using only a basic MSE diffusion loss for reconstruction, along with KL term and LPIPS perceptual loss from scratch. Even a scaled-down version of CDT (3$times inference speedup) still performs comparably with top baselines.
arXiv Detail & Related papers (2025-03-05T17:59:19Z)
DiffuEraser: A Diffusion Model for Video Inpainting [13.292164408616257]
We introduce DiffuEraser, a video inpainting model based on stable diffusion, to fill masked regions with greater details and more coherent structures. We also expand the temporal receptive fields of both the prior model and DiffuEraser, and further enhance consistency by leveraging the temporal smoothing property of Video Diffusion Models.
arXiv Detail & Related papers (2025-01-17T08:03:02Z)
DIVD: Deblurring with Improved Video Diffusion Model [8.816046910904488]
Diffusion models and video diffusion models have excelled in the fields of image and video generation.<n>We introduce a video diffusion model specifically for the task of video deblurring.<n>Our model outperforms existing models and achieves state-of-the-art results on a range of perceptual metrics.
arXiv Detail & Related papers (2024-12-01T11:39:02Z)
VISION-XL: High Definition Video Inverse Problem Solver using Latent Image Diffusion Models [58.464465016269614]
We propose a novel framework for solving high-definition video inverse problems using latent image diffusion models. Our approach delivers HD-resolution reconstructions in under 6 seconds per frame on a single NVIDIA 4090 GPU.
arXiv Detail & Related papers (2024-11-29T08:10:49Z)
Warped Diffusion: Solving Video Inverse Problems with Image Diffusion Models [56.691967706131]
We view frames as continuous functions in the 2D space, and videos as a sequence of continuous warping transformations between different frames. This perspective allows us to train function space diffusion models only on images and utilize them to solve temporally correlated inverse problems. Our method allows us to deploy state-of-the-art latent diffusion models such as Stable Diffusion XL to solve video inverse problems.
arXiv Detail & Related papers (2024-10-21T16:19:34Z)
A Survey on Diffusion Models for Inverse Problems [110.6628926886398]
We provide an overview of methods that utilize pre-trained diffusion models to solve inverse problems without requiring further training. We discuss specific challenges and potential solutions associated with using latent diffusion models for inverse problems.
arXiv Detail & Related papers (2024-09-30T17:34:01Z)
Ensemble Kalman Diffusion Guidance: A Derivative-free Method for Inverse Problems [21.95946380639509]
In inverse problems, it is increasingly popular to use pre-trained diffusion models as plug-and-play priors. Most existing methods rely on privileged information such as derivative, pseudo-inverse, or full knowledge about the forward model. We propose Ensemble Kalman Diffusion Guidance (EnKG) for diffusion models, a derivative-free approach that can solve inverse problems by only accessing forward model evaluations and a pre-trained diffusion model prior.
arXiv Detail & Related papers (2024-09-30T10:36:41Z)
Solving Video Inverse Problems Using Image Diffusion Models [58.464465016269614]
We introduce an innovative video inverse solver that leverages only image diffusion models.<n>Our method treats the time dimension of a video as the batch dimension image diffusion models.<n>We also introduce a batch-consistent sampling strategy that encourages consistency across batches.
arXiv Detail & Related papers (2024-09-04T09:48:27Z)
Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution [65.91317390645163]
Upscale-A-Video is a text-guided latent diffusion framework for video upscaling. It ensures temporal coherence through two key mechanisms: locally, it integrates temporal layers into U-Net and VAE-Decoder, maintaining consistency within short sequences. It also offers greater flexibility by allowing text prompts to guide texture creation and adjustable noise levels to balance restoration and generation.
arXiv Detail & Related papers (2023-12-11T18:54:52Z)
Solving Inverse Problems with Latent Diffusion Models via Hard Data Consistency [7.671153315762146]
Training diffusion models in the pixel space are both data-intensive and computationally demanding. Latent diffusion models, which operate in a much lower-dimensional space, offer a solution to these challenges. We propose textitReSample, an algorithm that can solve general inverse problems with pre-trained latent diffusion models.
arXiv Detail & Related papers (2023-07-16T18:42:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.