I2V-Adapter: A General Image-to-Video Adapter for Diffusion Models
- URL: http://arxiv.org/abs/2312.16693v4
- Date: Wed, 26 Jun 2024 18:00:02 GMT
- Title: I2V-Adapter: A General Image-to-Video Adapter for Diffusion Models
- Authors: Xun Guo, Mingwu Zheng, Liang Hou, Yuan Gao, Yufan Deng, Pengfei Wan, Di Zhang, Yufan Liu, Weiming Hu, Zhengjun Zha, Haibin Huang, Chongyang Ma,
- Abstract summary: Text-guided image-to-video (I2V) generation aims to generate a coherent video that preserves the identity of the input image.
I2V-Adapter adeptly propagates the unnoised input image to subsequent noised frames through a cross-frame attention mechanism.
Our experimental results demonstrate that I2V-Adapter is capable of producing high-quality videos.
- Score: 80.32562822058924
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-guided image-to-video (I2V) generation aims to generate a coherent video that preserves the identity of the input image and semantically aligns with the input prompt. Existing methods typically augment pretrained text-to-video (T2V) models by either concatenating the image with noised video frames channel-wise before being fed into the model or injecting the image embedding produced by pretrained image encoders in cross-attention modules. However, the former approach often necessitates altering the fundamental weights of pretrained T2V models, thus restricting the model's compatibility within the open-source communities and disrupting the model's prior knowledge. Meanwhile, the latter typically fails to preserve the identity of the input image. We present I2V-Adapter to overcome such limitations. I2V-Adapter adeptly propagates the unnoised input image to subsequent noised frames through a cross-frame attention mechanism, maintaining the identity of the input image without any changes to the pretrained T2V model. Notably, I2V-Adapter only introduces a few trainable parameters, significantly alleviating the training cost and also ensures compatibility with existing community-driven personalized models and control tools. Moreover, we propose a novel Frame Similarity Prior to balance the motion amplitude and the stability of generated videos through two adjustable control coefficients. Our experimental results demonstrate that I2V-Adapter is capable of producing high-quality videos. This performance, coupled with its agility and adaptability, represents a substantial advancement in the field of I2V, particularly for personalized and controllable applications.
Related papers
- PersonalVideo: High ID-Fidelity Video Customization without Dynamic and Semantic Degradation [36.21554597804604]
Identity-specific human video generation with customized ID images is still under-explored.
We propose a novel framework, dubbed textbfPersonalVideo, that applies direct supervision on videos synthesized by the T2V model.
Our method's superiority in delivering high identity faithfulness while preserving the inherent video generation qualities of the original T2V model, outshining prior approaches.
arXiv Detail & Related papers (2024-11-26T02:25:38Z) - FrameBridge: Improving Image-to-Video Generation with Bridge Models [23.19370431940568]
Image-to-video (I2V) generation is gaining increasing attention with its wide application in video synthesis.
We present FrameBridge, taking the given static image as the prior of video target and establishing a tractable bridge model between them.
We propose two techniques, SNR- Fine-tuning (SAF) and neural prior, which improve the fine-tuning efficiency of diffusion-based T2V models to FrameBridge and the synthesis quality of bridge-based I2V models respectively.
arXiv Detail & Related papers (2024-10-20T12:10:24Z) - FiTv2: Scalable and Improved Flexible Vision Transformer for Diffusion Model [76.84519526283083]
We present the textbfFlexible Vision Transformer (FiT), a transformer architecture specifically designed for generating images with textitunrestricted resolutions and aspect ratios
FiTv2 exhibits $2times$ convergence speed of FiT, when incorporating advanced training-free extrapolation techniques.
Comprehensive experiments demonstrate the exceptional performance of FiTv2 across a broad range of resolutions.
arXiv Detail & Related papers (2024-10-17T15:51:49Z) - AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction [88.70116693750452]
Text-guided video prediction (TVP) involves predicting the motion of future frames from the initial frame according to an instruction.
Previous TVP methods make significant breakthroughs by adapting Stable Diffusion for this task.
We introduce the Multi-Modal Large Language Model (MLLM) to predict future video states based on initial frames and text instructions.
arXiv Detail & Related papers (2024-06-10T17:02:08Z) - Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation [72.90144343056227]
We explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks.
We introduce a novel framework, termed "VD-IT", tailored with dedicatedly designed components built upon a fixed T2V model.
Our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-18T17:59:58Z) - I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion
Models [54.99771394322512]
Video synthesis has recently made remarkable strides benefiting from the rapid development of diffusion models.
It still challenges encounters in terms of semantic accuracy, clarity, and continuity-temporal continuity.
We propose a cascaded I2VGen-XL approach that enhances model performance by decoupling these two factors.
I2VGen-XL can simultaneously enhance the semantic accuracy, continuity of details and clarity of generated videos.
arXiv Detail & Related papers (2023-11-07T17:16:06Z) - ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video [15.952896909797728]
Adapting image models to the video domain has emerged as an efficient paradigm for solving video recognition tasks.
Recent research is shifting its focus toward parameter-efficient image-to-video adaptation.
We present a new adaptation paradigm (ZeroI2V) to transfer the image transformers to video recognition tasks.
arXiv Detail & Related papers (2023-10-02T16:41:20Z) - Make It Move: Controllable Image-to-Video Generation with Text
Descriptions [69.52360725356601]
TI2V task aims at generating videos from a static image and a text description.
To address these challenges, we propose a Motion Anchor-based video GEnerator (MAGE) with an innovative motion anchor structure.
Experiments conducted on datasets verify the effectiveness of MAGE and show appealing potentials of TI2V task.
arXiv Detail & Related papers (2021-12-06T07:00:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.