Time-adaptive Video Frame Interpolation based on Residual Diffusion
- URL: http://arxiv.org/abs/2504.05402v2
- Date: Tue, 15 Apr 2025 18:25:08 GMT
- Title: Time-adaptive Video Frame Interpolation based on Residual Diffusion
- Authors: Victor Fonte Chavez, Claudia Esteves, Jean-Bernard Hayet,
- Abstract summary: We propose a new diffusion-based method for video frame (VFI)<n>In this work, we propose a new diffusion-based method for video frame (VFI)<n>We provide extensive comparisons with respect to state-of-the-art models and show that our model outperforms these models on animation videos.
- Score: 2.5261465733373965
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we propose a new diffusion-based method for video frame interpolation (VFI), in the context of traditional hand-made animation. We introduce three main contributions: The first is that we explicitly handle the interpolation time in our model, which we also re-estimate during the training process, to cope with the particularly large variations observed in the animation domain, compared to natural videos; The second is that we adapt and generalize a diffusion scheme called ResShift recently proposed in the super-resolution community to VFI, which allows us to perform a very low number of diffusion steps (in the order of 10) to produce our estimates; The third is that we leverage the stochastic nature of the diffusion process to provide a pixel-wise estimate of the uncertainty on the interpolated frame, which could be useful to anticipate where the model may be wrong. We provide extensive comparisons with respect to state-of-the-art models and show that our model outperforms these models on animation videos. Our code is available at https://github.com/VicFonch/Multi-Input-Resshift-Diffusion-VFI.
Related papers
- Video Latent Flow Matching: Optimal Polynomial Projections for Video Interpolation and Extrapolation [11.77588746719272]
This paper considers an efficient video modeling process called Video Latent Flow Matching (VLFM)
Our method relies on current strong pre-trained image generation models, modeling a certain caption-guided flow of latent patches that can be decoded to time-dependent video frames.
We conduct experiments on several text-to-video datasets to showcase the effectiveness of our method.
arXiv Detail & Related papers (2025-02-01T17:40:11Z) - Adapting Image-to-Video Diffusion Models for Large-Motion Frame Interpolation [0.0]
We present a conditional encoder designed to adapt an image-to-video model for a large-motion frame.<n>To enhance performance, we integrate a dual-branch feature extractor and propose a cross-frame attention mechanism.<n>Our approach demonstrates superior performance on the Fr'teche Video Distance metric when evaluated against other state-of-the-art approaches.
arXiv Detail & Related papers (2024-12-22T14:49:55Z) - Generative Inbetweening: Adapting Image-to-Video Models for Keyframe Interpolation [60.27691946892796]
We present a method for generating video sequences with coherent motion between a pair of input key frames.<n>Our experiments show that our method outperforms both existing diffusion-based methods and traditional frame techniques.
arXiv Detail & Related papers (2024-08-27T17:57:14Z) - TVG: A Training-free Transition Video Generation Method with Diffusion Models [12.037716102326993]
Transition videos play a crucial role in media production, enhancing the flow and coherence of visual narratives.
Recent advances in diffusion model-based video generation offer new possibilities for creating transitions but face challenges such as poor inter-frame relationship modeling and abrupt content changes.
We propose a novel training-free Transition Video Generation (TVG) approach using video-level diffusion models that addresses these limitations without additional training.
arXiv Detail & Related papers (2024-08-24T00:33:14Z) - Disentangled Motion Modeling for Video Frame Interpolation [40.83962594702387]
Video Frame Interpolation (VFI) aims to synthesize intermediate frames between existing frames to enhance visual smoothness and quality.<n>We introduce disentangled Motion Modeling (MoMo), a diffusion-based approach for VFI that enhances visual quality by focusing on intermediate motion modeling.
arXiv Detail & Related papers (2024-06-25T03:50:20Z) - ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation [81.90265212988844]
We propose a training-free video method for generative video models in a plug-and-play manner.
We transform a video model into a self-cascaded video diffusion model with the designed hidden state correction modules.
Our training-free method is even comparable to trained models supported by huge compute resources and large-scale datasets.
arXiv Detail & Related papers (2024-06-03T00:31:13Z) - CV-VAE: A Compatible Video VAE for Latent Generative Video Models [45.702473834294146]
Variationalencoders (VAE) plays a crucial role in OpenAI's Auto-temporal compression of videos.
Currently, there lacks of a commonly used continuous video (3D) VAE for latent diffusion-based video models.
We propose a method for training a video VAE of latent video models, namely CV-VAE, whose latent space is compatible with that of a given image VAE.
arXiv Detail & Related papers (2024-05-30T17:33:10Z) - Zero-Shot Video Semantic Segmentation based on Pre-Trained Diffusion Models [96.97910688908956]
We introduce the first zero-shot approach for Video Semantic (VSS) based on pre-trained diffusion models.
We propose a framework tailored for VSS based on pre-trained image and video diffusion models.
Experiments show that our proposed approach outperforms existing zero-shot image semantic segmentation approaches.
arXiv Detail & Related papers (2024-05-27T08:39:38Z) - F3-Pruning: A Training-Free and Generalized Pruning Strategy towards
Faster and Finer Text-to-Video Synthesis [94.10861578387443]
We explore the inference process of two mainstream T2V models using transformers and diffusion models.
We propose a training-free and generalized pruning strategy called F3-Pruning to prune redundant temporal attention weights.
Extensive experiments on three datasets using a classic transformer-based model CogVideo and a typical diffusion-based model Tune-A-Video verify the effectiveness of F3-Pruning.
arXiv Detail & Related papers (2023-12-06T12:34:47Z) - Boost Video Frame Interpolation via Motion Adaptation [73.42573856943923]
Video frame (VFI) is a challenging task that aims to generate intermediate frames between two consecutive frames in a video.
Existing learning-based VFI methods have achieved great success, but they still suffer from limited generalization ability.
We propose a novel optimization-based VFI method that can adapt to unseen motions at test time.
arXiv Detail & Related papers (2023-06-24T10:44:02Z) - Diffusion Models as Masked Autoencoders [52.442717717898056]
We revisit generatively pre-training visual representations in light of recent interest in denoising diffusion models.
While directly pre-training with diffusion models does not produce strong representations, we condition diffusion models on masked input and formulate diffusion models as masked autoencoders (DiffMAE)
We perform a comprehensive study on the pros and cons of design choices and build connections between diffusion models and masked autoencoders.
arXiv Detail & Related papers (2023-04-06T17:59:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.