Related papers: Generative Inbetweening: Adapting Image-to-Video Models for Keyframe Interpolation

Generative Inbetweening: Adapting Image-to-Video Models for Keyframe Interpolation

URL: http://arxiv.org/abs/2408.15239v1
Date: Tue, 27 Aug 2024 17:57:14 GMT
Title: Generative Inbetweening: Adapting Image-to-Video Models for Keyframe Interpolation
Authors: Xiaojuan Wang, Boyang Zhou, Brian Curless, Ira Kemelmacher-Shlizerman, Aleksander Holynski, Steven M. Seitz,
Abstract summary: We present a method for generating video sequences with coherent motion between a pair of input key frames. Our experiments show that our method outperforms both existing diffusion-based methods and traditional frame techniques.
Score: 60.27691946892796
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We present a method for generating video sequences with coherent motion between a pair of input key frames. We adapt a pretrained large-scale image-to-video diffusion model (originally trained to generate videos moving forward in time from a single input image) for key frame interpolation, i.e., to produce a video in between two input frames. We accomplish this adaptation through a lightweight fine-tuning technique that produces a version of the model that instead predicts videos moving backwards in time from a single input image. This model (along with the original forward-moving model) is subsequently used in a dual-directional diffusion sampling process that combines the overlapping model estimates starting from each of the two keyframes. Our experiments show that our method outperforms both existing diffusion-based methods and traditional frame interpolation techniques.

Related papers

Time-adaptive Video Frame Interpolation based on Residual Diffusion [2.5261465733373965]
We propose a new diffusion-based method for video frame (VFI) In this work, we propose a new diffusion-based method for video frame (VFI) We provide extensive comparisons with respect to state-of-the-art models and show that our model outperforms these models on animation videos.
arXiv Detail & Related papers (2025-04-07T18:15:45Z)
Video Latent Flow Matching: Optimal Polynomial Projections for Video Interpolation and Extrapolation [11.77588746719272]
This paper considers an efficient video modeling process called Video Latent Flow Matching (VLFM) Our method relies on current strong pre-trained image generation models, modeling a certain caption-guided flow of latent patches that can be decoded to time-dependent video frames. We conduct experiments on several text-to-video datasets to showcase the effectiveness of our method.
arXiv Detail & Related papers (2025-02-01T17:40:11Z)
Adapting Image-to-Video Diffusion Models for Large-Motion Frame Interpolation [0.0]
We present a conditional encoder designed to adapt an image-to-video model for a large-motion frame. To enhance performance, we integrate a dual-branch feature extractor and propose a cross-frame attention mechanism. Our approach demonstrates superior performance on the Fr'teche Video Distance metric when evaluated against other state-of-the-art approaches.
arXiv Detail & Related papers (2024-12-22T14:49:55Z)
ViBiDSampler: Enhancing Video Interpolation Using Bidirectional Diffusion Sampler [53.98558445900626]
Current image-to-video diffusion models, while powerful in generating videos from a single frame, need adaptation for two-frame conditioned generation. We introduce a novel, bidirectional sampling strategy to address these off-manifold issues without requiring extensive re-noising or fine-tuning. Our method employs sequential sampling along both forward and backward paths, conditioned on the start and end frames, respectively, ensuring more coherent and on-manifold generation of intermediate frames.
arXiv Detail & Related papers (2024-10-08T03:01:54Z)
Disentangled Motion Modeling for Video Frame Interpolation [40.83962594702387]
Video frame (VFI) aims to synthesize intermediate frames in between existing frames to enhance visual smoothness and quality. We introduce disentangled Motion Modeling (MoMo), a diffusion-based approach for VFI that enhances visual quality by focusing on intermediate motion modeling.
arXiv Detail & Related papers (2024-06-25T03:50:20Z)
ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation [81.90265212988844]
We propose a training-free video method for generative video models in a plug-and-play manner. We transform a video model into a self-cascaded video diffusion model with the designed hidden state correction modules. Our training-free method is even comparable to trained models supported by huge compute resources and large-scale datasets.
arXiv Detail & Related papers (2024-06-03T00:31:13Z)
Training-Free Semantic Video Composition via Pre-trained Diffusion Model [96.0168609879295]
Current approaches, predominantly trained on videos with adjusted foreground color and lighting, struggle to address deep semantic disparities beyond superficial adjustments. We propose a training-free pipeline employing a pre-trained diffusion model imbued with semantic prior knowledge. Experimental results reveal that our pipeline successfully ensures the visual harmony and inter-frame coherence of the outputs.
arXiv Detail & Related papers (2024-01-17T13:07:22Z)
Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation [93.18163456287164]
This paper proposes a novel text-guided video-to-video translation framework to adapt image models to videos. Our framework achieves global style and local texture temporal consistency at a low cost.
arXiv Detail & Related papers (2023-06-13T17:52:23Z)
TTVFI: Learning Trajectory-Aware Transformer for Video Frame Interpolation [50.49396123016185]
Video frame (VFI) aims to synthesize an intermediate frame between two consecutive frames. We propose a novel Trajectory-aware Transformer for Video Frame Interpolation (TTVFI) Our method outperforms other state-of-the-art methods in four widely-used VFI benchmarks.
arXiv Detail & Related papers (2022-07-19T03:37:49Z)
ALANET: Adaptive Latent Attention Network forJoint Video Deblurring and Interpolation [38.52446103418748]
We introduce a novel architecture, Adaptive Latent Attention Network (ALANET), which synthesizes sharp high frame-rate videos. We employ combination of self-attention and cross-attention module between consecutive frames in the latent space to generate optimized representation for each frame. Our method performs favorably against various state-of-the-art approaches, even though we tackle a much more difficult problem.
arXiv Detail & Related papers (2020-08-31T21:11:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.