SimDA: Simple Diffusion Adapter for Efficient Video Generation
- URL: http://arxiv.org/abs/2308.09710v1
- Date: Fri, 18 Aug 2023 17:58:44 GMT
- Title: SimDA: Simple Diffusion Adapter for Efficient Video Generation
- Authors: Zhen Xing, Qi Dai, Han Hu, Zuxuan Wu, Yu-Gang Jiang
- Abstract summary: We propose a Simple Diffusion Adapter (SimDA) that fine-tunes only 24M out of 1.1B parameters of a strong T2I model, adapting it to video generation in a parameter-efficient way.
In addition to T2V generation in the wild, SimDA could also be utilized in one-shot video editing with only 2 minutes tuning.
- Score: 102.90154301044095
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recent wave of AI-generated content has witnessed the great development
and success of Text-to-Image (T2I) technologies. By contrast, Text-to-Video
(T2V) still falls short of expectations though attracting increasing interests.
Existing works either train from scratch or adapt large T2I model to videos,
both of which are computation and resource expensive. In this work, we propose
a Simple Diffusion Adapter (SimDA) that fine-tunes only 24M out of 1.1B
parameters of a strong T2I model, adapting it to video generation in a
parameter-efficient way. In particular, we turn the T2I model for T2V by
designing light-weight spatial and temporal adapters for transfer learning.
Besides, we change the original spatial attention to the proposed Latent-Shift
Attention (LSA) for temporal consistency. With similar model architecture, we
further train a video super-resolution model to generate high-definition
(1024x1024) videos. In addition to T2V generation in the wild, SimDA could also
be utilized in one-shot video editing with only 2 minutes tuning. Doing so, our
method could minimize the training effort with extremely few tunable parameters
for model adaptation.
Related papers
- FrameBridge: Improving Image-to-Video Generation with Bridge Models [23.19370431940568]
Image-to-video (I2V) generation is gaining increasing attention with its wide application in video synthesis.
We present FrameBridge, taking the given static image as the prior of video target and establishing a tractable bridge model between them.
We propose two techniques, SNR- Fine-tuning (SAF) and neural prior, which improve the fine-tuning efficiency of diffusion-based T2V models to FrameBridge and the synthesis quality of bridge-based I2V models respectively.
arXiv Detail & Related papers (2024-10-20T12:10:24Z) - Still-Moving: Customized Video Generation without Customized Video Data [81.09302547183155]
We introduce Still-Moving, a novel framework for customizing a text-to-video (T2V) model.
The framework applies to the prominent T2V design where the video model is built over a text-to-image (T2I) model.
We train lightweight $textitSpatial Adapters$ that adjust the features produced by the injected T2I layers.
arXiv Detail & Related papers (2024-07-11T17:06:53Z) - AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction [88.70116693750452]
Text-guided video prediction (TVP) involves predicting the motion of future frames from the initial frame according to an instruction.
Previous TVP methods make significant breakthroughs by adapting Stable Diffusion for this task.
We introduce the Multi-Modal Large Language Model (MLLM) to predict future video states based on initial frames and text instructions.
arXiv Detail & Related papers (2024-06-10T17:02:08Z) - VideoElevator: Elevating Video Generation Quality with Versatile
Text-to-Image Diffusion Models [94.25084162939488]
Text-to-video diffusion models (T2V) still lag far behind in frame quality and text alignment.
We introduce VideoElevator, a training-free and plug-and-play method, which elevates the performance of T2V using superior capabilities of T2I.
arXiv Detail & Related papers (2024-03-08T16:44:54Z) - Tune-A-Video: One-Shot Tuning of Image Diffusion Models for
Text-to-Video Generation [31.882356164068753]
To reproduce the success of text-to-image (T2I) generation, recent works in text-to-video (T2V) generation employ massive dataset for dataset for T2V generation.
We propose Tune-A-Video is capable of producing temporally-coherent videos over various applications.
arXiv Detail & Related papers (2022-12-22T09:43:36Z) - Make-A-Video: Text-to-Video Generation without Text-Video Data [69.20996352229422]
Make-A-Video is an approach for translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V)
We design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules.
In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation.
arXiv Detail & Related papers (2022-09-29T13:59:46Z) - MVFNet: Multi-View Fusion Network for Efficient Video Recognition [79.92736306354576]
We introduce a multi-view fusion (MVF) module to exploit video complexity using separable convolution for efficiency.
MVFNet can be thought of as a generalized video modeling framework.
arXiv Detail & Related papers (2020-12-13T06:34:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.