AtomoVideo: High Fidelity Image-to-Video Generation
- URL: http://arxiv.org/abs/2403.01800v2
- Date: Tue, 5 Mar 2024 08:19:51 GMT
- Title: AtomoVideo: High Fidelity Image-to-Video Generation
- Authors: Litong Gong, Yiran Zhu, Weijie Li, Xiaoyang Kang, Biao Wang, Tiezheng
Ge, Bo Zheng
- Abstract summary: We propose a high fidelity framework for image-to-video generation, named AtomoVideo.
Based on multi-granularity image injection, we achieve higher fidelity of the generated video to the given image.
Our architecture extends flexibly to the video frame prediction task, enabling long sequence prediction through iterative generation.
- Score: 25.01443995920118
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, video generation has achieved significant rapid development based
on superior text-to-image generation techniques. In this work, we propose a
high fidelity framework for image-to-video generation, named AtomoVideo. Based
on multi-granularity image injection, we achieve higher fidelity of the
generated video to the given image. In addition, thanks to high quality
datasets and training strategies, we achieve greater motion intensity while
maintaining superior temporal consistency and stability. Our architecture
extends flexibly to the video frame prediction task, enabling long sequence
prediction through iterative generation. Furthermore, due to the design of
adapter training, our approach can be well combined with existing personalized
models and controllable modules. By quantitatively and qualitatively
evaluation, AtomoVideo achieves superior results compared to popular methods,
more examples can be found on our project website:
https://atomo-video.github.io/.
Related papers
- MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance [11.267119929093042]
We propose a controllable video generation framework, dubbed MimicMotion, which can generate high-quality videos of arbitrary length.
confidence-aware pose guidance ensures high frame quality and temporal smoothness.
For generating long and smooth videos, we propose a progressive latent fusion strategy.
arXiv Detail & Related papers (2024-06-28T06:40:53Z) - Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video
Synthesis [69.83405335645305]
We argue that naively bringing advances of image models to the video generation domain reduces motion fidelity, visual quality and impairs scalability.
In this work, we build Snap Video, a video-first model that systematically addresses these challenges.
We show that a U-Net - a workhorse behind image generation - scales poorly when generating videos, requiring significant computational overhead.
This allows us to efficiently train a text-to-video model with billions of parameters for the first time, reach state-of-the-art results on a number of benchmarks, and generate videos with substantially higher quality, temporal consistency, and motion complexity.
arXiv Detail & Related papers (2024-02-22T18:55:08Z) - DreamVideo: High-Fidelity Image-to-Video Generation with Image Retention and Text Guidance [69.0740091741732]
We propose a high-fidelity image-to-video generation method by devising a frame retention branch based on a pre-trained video diffusion model, named DreamVideo.
Our model has a powerful image retention ability and delivers the best results in UCF101 compared to other image-to-video models to our best knowledge.
arXiv Detail & Related papers (2023-12-05T03:16:31Z) - VideoGen: A Reference-Guided Latent Diffusion Approach for High
Definition Text-to-Video Generation [73.54366331493007]
VideoGen is a text-to-video generation approach, which can generate a high-definition video with high frame fidelity and strong temporal consistency.
We leverage an off-the-shelf text-to-image generation model, e.g., Stable Diffusion, to generate an image with high content quality from the text prompt.
arXiv Detail & Related papers (2023-09-01T11:14:43Z) - Imagen Video: High Definition Video Generation with Diffusion Models [64.06483414521222]
Imagen Video is a text-conditional video generation system based on a cascade of video diffusion models.
We find Imagen Video capable of generating videos of high fidelity, but also having a high degree of controllability and world knowledge.
arXiv Detail & Related papers (2022-10-05T14:41:38Z) - HARP: Autoregressive Latent Video Prediction with High-Fidelity Image
Generator [90.74663948713615]
We train an autoregressive latent video prediction model capable of predicting high-fidelity future frames.
We produce high-resolution (256x256) videos with minimal modification to existing models.
arXiv Detail & Related papers (2022-09-15T08:41:57Z) - Video Diffusion Models [47.99413440461512]
Generating temporally coherent high fidelity video is an important milestone in generative modeling research.
We propose a diffusion model for video generation that shows very promising initial results.
We present the first results on a large text-conditioned video generation task, as well as state-of-the-art results on an established unconditional video generation benchmark.
arXiv Detail & Related papers (2022-04-07T14:08:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.