Adapting Image-to-Video Diffusion Models for Large-Motion Frame Interpolation
- URL: http://arxiv.org/abs/2412.17042v3
- Date: Mon, 17 Feb 2025 05:13:09 GMT
- Title: Adapting Image-to-Video Diffusion Models for Large-Motion Frame Interpolation
- Authors: Luoxu Jin, Hiroshi Watanabe,
- Abstract summary: We present a conditional encoder designed to adapt an image-to-video model for a large-motion frame.
To enhance performance, we integrate a dual-branch feature extractor and propose a cross-frame attention mechanism.
Our approach demonstrates superior performance on the Fr'teche Video Distance metric when evaluated against other state-of-the-art approaches.
- Score: 0.0
- License:
- Abstract: With the development of video generation models has advanced significantly in recent years, we adopt large-scale image-to-video diffusion models for video frame interpolation. We present a conditional encoder designed to adapt an image-to-video model for large-motion frame interpolation. To enhance performance, we integrate a dual-branch feature extractor and propose a cross-frame attention mechanism that effectively captures both spatial and temporal information, enabling accurate interpolations of intermediate frames. Our approach demonstrates superior performance on the Fr\'echet Video Distance (FVD) metric when evaluated against other state-of-the-art approaches, particularly in handling large motion scenarios, highlighting advancements in generative-based methodologies.
Related papers
- Motion-Aware Generative Frame Interpolation [23.380470636851022]
We propose Motion-aware Generative frame (MoG) to enhance the model's motion awareness by integrating explicit motion guidance.
To demonstrate the versatility of our method, we train MoG on both real-world and animation datasets.
arXiv Detail & Related papers (2025-01-07T11:03:43Z) - ViBiDSampler: Enhancing Video Interpolation Using Bidirectional Diffusion Sampler [53.98558445900626]
Current image-to-video diffusion models, while powerful in generating videos from a single frame, need adaptation for two-frame conditioned generation.
We introduce a novel, bidirectional sampling strategy to address these off-manifold issues without requiring extensive re-noising or fine-tuning.
Our method employs sequential sampling along both forward and backward paths, conditioned on the start and end frames, respectively, ensuring more coherent and on-manifold generation of intermediate frames.
arXiv Detail & Related papers (2024-10-08T03:01:54Z) - Generative Inbetweening: Adapting Image-to-Video Models for Keyframe Interpolation [60.27691946892796]
We present a method for generating video sequences with coherent motion between a pair of input key frames.
Our experiments show that our method outperforms both existing diffusion-based methods and traditional frame techniques.
arXiv Detail & Related papers (2024-08-27T17:57:14Z) - Enhanced Bi-directional Motion Estimation for Video Frame Interpolation [0.05541644538483946]
We present a novel yet effective algorithm for motion-based video frame estimation.
Our method achieves excellent performance on a broad range of video frame benchmarks.
arXiv Detail & Related papers (2022-06-17T06:08:43Z) - Video Frame Interpolation with Transformer [55.12620857638253]
We introduce a novel framework, which takes advantage of Transformer to model long-range pixel correlation among video frames.
Our network is equipped with a novel cross-scale window-based attention mechanism, where cross-scale windows interact with each other.
arXiv Detail & Related papers (2022-05-15T09:30:28Z) - Video Frame Interpolation Transformer [86.20646863821908]
We propose a Transformer-based video framework that allows content-aware aggregation weights and considers long-range dependencies with the self-attention operations.
To avoid the high computational cost of global self-attention, we introduce the concept of local attention into video.
In addition, we develop a multi-scale frame scheme to fully realize the potential of Transformers.
arXiv Detail & Related papers (2021-11-27T05:35:10Z) - ARVo: Learning All-Range Volumetric Correspondence for Video Deblurring [92.40655035360729]
Video deblurring models exploit consecutive frames to remove blurs from camera shakes and object motions.
We propose a novel implicit method to learn spatial correspondence among blurry frames in the feature space.
Our proposed method is evaluated on the widely-adopted DVD dataset, along with a newly collected High-Frame-Rate (1000 fps) dataset for Video Deblurring.
arXiv Detail & Related papers (2021-03-07T04:33:13Z) - All at Once: Temporally Adaptive Multi-Frame Interpolation with Advanced
Motion Modeling [52.425236515695914]
State-of-the-art methods are iterative solutions interpolating one frame at the time.
This work introduces a true multi-frame interpolator.
It utilizes a pyramidal style network in the temporal domain to complete the multi-frame task in one-shot.
arXiv Detail & Related papers (2020-07-23T02:34:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.