STeP: A General and Scalable Framework for Solving Video Inverse Problems with Spatiotemporal Diffusion Priors
- URL: http://arxiv.org/abs/2504.07549v1
- Date: Thu, 10 Apr 2025 08:24:26 GMT
- Title: STeP: A General and Scalable Framework for Solving Video Inverse Problems with Spatiotemporal Diffusion Priors
- Authors: Bingliang Zhang, Zihui Wu, Berthy T. Feng, Yang Song, Yisong Yue, Katherine L. Bouman,
- Abstract summary: We study how to solve general inverse problems involving videos using diffusion model priors.<n>We introduce a general and scalable framework for solving video inverse problems.
- Score: 27.45644471304381
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study how to solve general Bayesian inverse problems involving videos using diffusion model priors. While it is desirable to use a video diffusion prior to effectively capture complex temporal relationships, due to the computational and data requirements of training such a model, prior work has instead relied on image diffusion priors on single frames combined with heuristics to enforce temporal consistency. However, these approaches struggle with faithfully recovering the underlying temporal relationships, particularly for tasks with high temporal uncertainty. In this paper, we demonstrate the feasibility of practical and accessible spatiotemporal diffusion priors by fine-tuning latent video diffusion models from pretrained image diffusion models using limited videos in specific domains. Leveraging this plug-and-play spatiotemporal diffusion prior, we introduce a general and scalable framework for solving video inverse problems. We then apply our framework to two challenging scientific video inverse problems--black hole imaging and dynamic MRI. Our framework enables the generation of diverse, high-fidelity video reconstructions that not only fit observations but also recover multi-modal solutions. By incorporating a spatiotemporal diffusion prior, we significantly improve our ability to capture complex temporal relationships in the data while also enhancing spatial fidelity.
Related papers
- LeAdQA: LLM-Driven Context-Aware Temporal Grounding for Video Question Answering [10.060267989615813]
We introduce LeAdQA, an innovative approach that bridges these gaps through synergizing causal-aware query refinement with fine-grained visual grounding.<n> Experiments on NExT-QA, IntentQA, and NExT-GQA demonstrate that our method's precise visual grounding substantially enhances the understanding of video-question relationships.
arXiv Detail & Related papers (2025-07-20T01:57:00Z) - MOOSE: Pay Attention to Temporal Dynamics for Video Understanding via Optical Flows [21.969862773424314]
MOOSE is a novel temporally-centric video encoder that integrates optical flow with spatial embeddings to model temporal information efficiently.<n>Unlike prior models, MOOSE takes advantage of rich, widely available pre-trained visual and optical flow encoders instead of training video models from scratch.
arXiv Detail & Related papers (2025-06-01T18:53:27Z) - Temporal-Consistent Video Restoration with Pre-trained Diffusion Models [51.47188802535954]
Video restoration (VR) aims to recover high-quality videos from degraded ones.<n>Recent zero-shot VR methods using pre-trained diffusion models (DMs) suffer from approximation errors during reverse diffusion and insufficient temporal consistency.<n>We present a novel a Posterior Maximum (MAP) framework that directly parameterizes video frames in the seed space of DMs, eliminating approximation errors.
arXiv Detail & Related papers (2025-03-19T03:41:56Z) - Rethinking Video Tokenization: A Conditioned Diffusion-based Approach [58.164354605550194]
New tokenizer, Diffusion Conditioned-based Gene Tokenizer, replaces GAN-based decoder with conditional diffusion model.
We trained using only a basic MSE diffusion loss for reconstruction, along with KL term and LPIPS perceptual loss from scratch.
Even a scaled-down version of CDT (3$times inference speedup) still performs comparably with top baselines.
arXiv Detail & Related papers (2025-03-05T17:59:19Z) - DiffVSR: Revealing an Effective Recipe for Taming Robust Video Super-Resolution Against Complex Degradations [25.756755602342942]
We present DiffVSR, featuring a Progressive Learning Strategy (PLS) that systematically decomposes this learning burden through staged training.<n>Our framework additionally incorporates an Interweaved Latent Transition (ILT) technique that maintains competitive temporal consistency without additional training overhead.
arXiv Detail & Related papers (2025-01-17T10:53:03Z) - DiffuEraser: A Diffusion Model for Video Inpainting [13.292164408616257]
We introduce DiffuEraser, a video inpainting model based on stable diffusion, to fill masked regions with greater details and more coherent structures.
We also expand the temporal receptive fields of both the prior model and DiffuEraser, and further enhance consistency by leveraging the temporal smoothing property of Video Diffusion Models.
arXiv Detail & Related papers (2025-01-17T08:03:02Z) - DIVD: Deblurring with Improved Video Diffusion Model [8.816046910904488]
Diffusion models and video diffusion models have excelled in the fields of image and video generation.<n>We introduce a video diffusion model specifically for the task of video deblurring.<n>Our model outperforms existing models and achieves state-of-the-art results on a range of perceptual metrics.
arXiv Detail & Related papers (2024-12-01T11:39:02Z) - VISION-XL: High Definition Video Inverse Problem Solver using Latent Image Diffusion Models [58.464465016269614]
We propose a novel framework for solving high-definition video inverse problems using latent image diffusion models.
Our approach delivers HD-resolution reconstructions in under 6 seconds per frame on a single NVIDIA 4090 GPU.
arXiv Detail & Related papers (2024-11-29T08:10:49Z) - Warped Diffusion: Solving Video Inverse Problems with Image Diffusion Models [56.691967706131]
We view frames as continuous functions in the 2D space, and videos as a sequence of continuous warping transformations between different frames.
This perspective allows us to train function space diffusion models only on images and utilize them to solve temporally correlated inverse problems.
Our method allows us to deploy state-of-the-art latent diffusion models such as Stable Diffusion XL to solve video inverse problems.
arXiv Detail & Related papers (2024-10-21T16:19:34Z) - A Survey on Diffusion Models for Inverse Problems [110.6628926886398]
We provide an overview of methods that utilize pre-trained diffusion models to solve inverse problems without requiring further training.
We discuss specific challenges and potential solutions associated with using latent diffusion models for inverse problems.
arXiv Detail & Related papers (2024-09-30T17:34:01Z) - Ensemble Kalman Diffusion Guidance: A Derivative-free Method for Inverse Problems [21.95946380639509]
In inverse problems, it is increasingly popular to use pre-trained diffusion models as plug-and-play priors.
Most existing methods rely on privileged information such as derivative, pseudo-inverse, or full knowledge about the forward model.
We propose Ensemble Kalman Diffusion Guidance (EnKG) for diffusion models, a derivative-free approach that can solve inverse problems by only accessing forward model evaluations and a pre-trained diffusion model prior.
arXiv Detail & Related papers (2024-09-30T10:36:41Z) - Solving Video Inverse Problems Using Image Diffusion Models [58.464465016269614]
We introduce an innovative video inverse solver that leverages only image diffusion models.<n>Our method treats the time dimension of a video as the batch dimension image diffusion models.<n>We also introduce a batch-consistent sampling strategy that encourages consistency across batches.
arXiv Detail & Related papers (2024-09-04T09:48:27Z) - Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World
Video Super-Resolution [65.91317390645163]
Upscale-A-Video is a text-guided latent diffusion framework for video upscaling.
It ensures temporal coherence through two key mechanisms: locally, it integrates temporal layers into U-Net and VAE-Decoder, maintaining consistency within short sequences.
It also offers greater flexibility by allowing text prompts to guide texture creation and adjustable noise levels to balance restoration and generation.
arXiv Detail & Related papers (2023-12-11T18:54:52Z) - Solving Inverse Problems with Latent Diffusion Models via Hard Data Consistency [7.671153315762146]
Training diffusion models in the pixel space are both data-intensive and computationally demanding.
Latent diffusion models, which operate in a much lower-dimensional space, offer a solution to these challenges.
We propose textitReSample, an algorithm that can solve general inverse problems with pre-trained latent diffusion models.
arXiv Detail & Related papers (2023-07-16T18:42:01Z) - Leaping Into Memories: Space-Time Deep Feature Synthesis [93.10032043225362]
We propose LEAPS, an architecture-independent method for synthesizing videos from internal models.
We quantitatively and qualitatively evaluate the applicability of LEAPS by inverting a range of architectures convolutional attention-based on Kinetics-400.
arXiv Detail & Related papers (2023-03-17T12:55:22Z) - Time Is MattEr: Temporal Self-supervision for Video Transformers [72.42240984211283]
We design simple yet effective self-supervised tasks for video models to learn temporal dynamics better.
Our method learns the temporal order of video frames as extra self-supervision and enforces the randomly shuffled frames to have low-confidence outputs.
Under various video action recognition tasks, we demonstrate the effectiveness of our method and its compatibility with state-of-the-art Video Transformers.
arXiv Detail & Related papers (2022-07-19T04:44:08Z) - Spatiotemporal Inconsistency Learning for DeepFake Video Detection [51.747219106855624]
We present a novel temporal modeling paradigm in TIM by exploiting the temporal difference over adjacent frames along with both horizontal and vertical directions.
And the ISM simultaneously utilizes the spatial information from SIM and temporal information from TIM to establish a more comprehensive spatial-temporal representation.
arXiv Detail & Related papers (2021-09-04T13:05:37Z) - An Efficient Recurrent Adversarial Framework for Unsupervised Real-Time
Video Enhancement [132.60976158877608]
We propose an efficient adversarial video enhancement framework that learns directly from unpaired video examples.
In particular, our framework introduces new recurrent cells that consist of interleaved local and global modules for implicit integration of spatial and temporal information.
The proposed design allows our recurrent cells to efficiently propagate-temporal-information across frames and reduces the need for high complexity networks.
arXiv Detail & Related papers (2020-12-24T00:03:29Z) - Intrinsic Temporal Regularization for High-resolution Human Video
Synthesis [59.54483950973432]
temporal consistency is crucial for extending image processing pipelines to the video domain.
We propose an effective intrinsic temporal regularization scheme, where an intrinsic confidence map is estimated via the frame generator to regulate motion estimation.
We apply our intrinsic temporal regulation to single-image generator, leading to a powerful " INTERnet" capable of generating $512times512$ resolution human action videos.
arXiv Detail & Related papers (2020-12-11T05:29:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.