LDMVFI: Video Frame Interpolation with Latent Diffusion Models
- URL: http://arxiv.org/abs/2303.09508v3
- Date: Mon, 11 Dec 2023 15:17:20 GMT
- Title: LDMVFI: Video Frame Interpolation with Latent Diffusion Models
- Authors: Duolikun Danier, Fan Zhang, David Bull
- Abstract summary: We propose latent diffusion model-based VFI, LDMVFI.
This approaches the VFI problem from a generative perspective by formulating it as a conditional generation problem.
Our experiments and user study indicate that LDMVFI is able to interpolate video content with favorable perceptual quality compared to the state of the art, even in the high-resolution regime.
- Score: 3.884484241124158
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Existing works on video frame interpolation (VFI) mostly employ deep neural
networks that are trained by minimizing the L1, L2, or deep feature space
distance (e.g. VGG loss) between their outputs and ground-truth frames.
However, recent works have shown that these metrics are poor indicators of
perceptual VFI quality. Towards developing perceptually-oriented VFI methods,
in this work we propose latent diffusion model-based VFI, LDMVFI. This
approaches the VFI problem from a generative perspective by formulating it as a
conditional generation problem. As the first effort to address VFI using latent
diffusion models, we rigorously benchmark our method on common test sets used
in the existing VFI literature. Our quantitative experiments and user study
indicate that LDMVFI is able to interpolate video content with favorable
perceptual quality compared to the state of the art, even in the
high-resolution regime. Our code is available at
https://github.com/danier97/LDMVFI.
Related papers
- Motion-aware Latent Diffusion Models for Video Frame Interpolation [51.78737270917301]
Motion estimation between neighboring frames plays a crucial role in avoiding motion ambiguity.
We propose a novel diffusion framework, motion-aware latent diffusion models (MADiff)
Our method achieves state-of-the-art performance significantly outperforming existing approaches.
arXiv Detail & Related papers (2024-04-21T05:09:56Z) - Diffusion-Based Particle-DETR for BEV Perception [94.88305708174796]
Bird-Eye-View (BEV) is one of the most widely-used scene representations for visual perception in Autonomous Vehicles (AVs)
Recent diffusion-based methods offer a promising approach to uncertainty modeling for visual perception but fail to effectively detect small objects in the large coverage of the BEV.
Here, we address this problem by combining the diffusion paradigm with current state-of-the-art 3D object detectors in BEV.
arXiv Detail & Related papers (2023-12-18T09:52:14Z) - Flow-Guided Diffusion for Video Inpainting [15.478104117672803]
Video inpainting has been challenged by complex scenarios like large movements and low-light conditions.
Current methods, including emerging diffusion models, face limitations in quality and efficiency.
This paper introduces the Flow-Guided Diffusion model for Video Inpainting (FGDVI), a novel approach that significantly enhances temporal consistency and inpainting quality.
arXiv Detail & Related papers (2023-11-26T17:48:48Z) - A Multi-In-Single-Out Network for Video Frame Interpolation without
Optical Flow [14.877766449009119]
deep learning-based video frame (VFI) methods have predominantly focused on estimating motion between two input frames.
We propose a multi-in-single-out (MISO) based VFI method that does not rely on motion vector estimation.
We introduce a novel motion perceptual loss that enables MISO-VFI to better capture the vectors-temporal within the video frames.
arXiv Detail & Related papers (2023-11-20T08:29:55Z) - Boost Video Frame Interpolation via Motion Adaptation [73.42573856943923]
Video frame (VFI) is a challenging task that aims to generate intermediate frames between two consecutive frames in a video.
Existing learning-based VFI methods have achieved great success, but they still suffer from limited generalization ability.
We propose a novel optimization-based VFI method that can adapt to unseen motions at test time.
arXiv Detail & Related papers (2023-06-24T10:44:02Z) - Diffusion Models as Masked Autoencoders [52.442717717898056]
We revisit generatively pre-training visual representations in light of recent interest in denoising diffusion models.
While directly pre-training with diffusion models does not produce strong representations, we condition diffusion models on masked input and formulate diffusion models as masked autoencoders (DiffMAE)
We perform a comprehensive study on the pros and cons of design choices and build connections between diffusion models and masked autoencoders.
arXiv Detail & Related papers (2023-04-06T17:59:56Z) - Exploring Vision Transformers as Diffusion Learners [15.32238726790633]
We systematically explore vision Transformers as diffusion learners for various generative tasks.
With our improvements the performance of vanilla ViT-based backbone (IU-ViT) is boosted to be on par with traditional U-Net-based methods.
We are the first to successfully train a single diffusion model on text-to-image task beyond 64x64 resolution.
arXiv Detail & Related papers (2022-12-28T10:32:59Z) - Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision
and Language Models [67.31684040281465]
We present textbfMOV, a simple yet effective method for textbfMultimodal textbfOpen-textbfVocabulary video classification.
In MOV, we directly use the vision encoder from pre-trained VLMs with minimal modifications to encode video, optical flow and audio spectrogram.
arXiv Detail & Related papers (2022-07-15T17:59:11Z) - An Empirical Study of Training End-to-End Vision-and-Language
Transformers [50.23532518166621]
We present METER(textbfMultimodal textbfEnd-to-end textbfTransformtextbfER), through which we investigate how to design and pre-train a fully transformer-based VL model.
Specifically, we dissect the model designs along multiple dimensions: vision encoders (e.g., CLIP-ViT, Swin transformer), text encoders (e.g., RoBERTa, DeBERTa), multimodal fusion (e.g., merged attention vs. co-
arXiv Detail & Related papers (2021-11-03T17:55:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.