Related papers: The Dawn of Video Generation: Preliminary Explorations with SORA-like Models

The Dawn of Video Generation: Preliminary Explorations with SORA-like Models

URL: http://arxiv.org/abs/2410.05227v2
Date: Thu, 10 Oct 2024 14:17:30 GMT
Title: The Dawn of Video Generation: Preliminary Explorations with SORA-like Models
Authors: Ailing Zeng, Yuhang Yang, Weidong Chen, Wei Liu,
Abstract summary: High-quality video generation, encompassing text-to-video (T2V), image-to-video (I2V), and video-to-video (V2V) generation, holds considerable significance in content creation. Models like SORA have advanced generating videos with higher resolution, more natural motion, better vision-language alignment, and increased controllability.
Score: 14.528428430884015
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: High-quality video generation, encompassing text-to-video (T2V), image-to-video (I2V), and video-to-video (V2V) generation, holds considerable significance in content creation to benefit anyone express their inherent creativity in new ways and world simulation to modeling and understanding the world. Models like SORA have advanced generating videos with higher resolution, more natural motion, better vision-language alignment, and increased controllability, particularly for long video sequences. These improvements have been driven by the evolution of model architectures, shifting from UNet to more scalable and parameter-rich DiT models, along with large-scale data expansion and refined training strategies. However, despite the emergence of DiT-based closed-source and open-source models, a comprehensive investigation into their capabilities and limitations remains lacking. Furthermore, the rapid development has made it challenging for recent benchmarks to fully cover SORA-like models and recognize their significant advancements. Additionally, evaluation metrics often fail to align with human preferences.

Related papers

DANCER: Dance ANimation via Condition Enhancement and Rendering with diffusion model [5.78710251788825]
We propose a novel framework, named as DANCER, for realistic single-person dance synthesis based on the most recent stable video diffusion model.<n>We introduce two important modules into our framework to fully benefit from the two inputs.<n>To further improve the generation capability of our model, we also collect a large amount of video data from Internet, and generate a novel datasetTikTok-3K to enhance the model training.
arXiv Detail & Related papers (2025-10-31T04:42:08Z)
Controllable Video Generation: A Survey [72.38313362192784]
We provide a systematic review of controllable video generation, covering both theoretical foundations and recent advances in the field.<n>We begin by introducing the key concepts and commonly used open-source video generation models.<n>We then focus on control mechanisms in video diffusion models, analyzing how different types of conditions can be incorporated into the denoising process to guide generation.
arXiv Detail & Related papers (2025-07-22T06:05:34Z)
Dimension-Reduction Attack! Video Generative Models are Experts on Controllable Image Synthesis [12.160537328404622]
textttDRA-Ctrl provides new insights into reusing resource-intensive video models.<n>textttDRA-Ctrl lays foundation for future unified generative models across visual modalities.
arXiv Detail & Related papers (2025-05-29T10:34:45Z)
Learning World Models for Interactive Video Generation [21.028685555298292]
We enhance image-to-video models with interactive capabilities through additional action conditioning and autoregressive framework.<n>We propose video retrieval augmented generation (VRAG) with explicit global state conditioning, which significantly reduces long-term compounding errors.<n>Our work establishes a comprehensive benchmark for improving video generation models with internal world modeling capabilities.
arXiv Detail & Related papers (2025-05-28T05:55:44Z)
RAGME: Retrieval Augmented Video Generation for Enhanced Motion Realism [73.38167494118746]
We propose a framework to improve the realism of motion in generated videos. We advocate for the incorporation of a retrieval mechanism during the generation phase. Our pipeline is designed to apply to any text-to-video diffusion model.
arXiv Detail & Related papers (2025-04-09T08:14:05Z)
Wan: Open and Advanced Large-Scale Video Generative Models [83.03603932233275]
Wan is a suite of video foundation models designed to push the boundaries of video generation. We open-source the entire series of Wan, including source code and all models, with the goal of fostering the growth of the video generation community.
arXiv Detail & Related papers (2025-03-26T08:25:43Z)
Enhance-A-Video: Better Generated Video for Free [57.620595159855064]
We introduce a training-free approach to enhance the coherence and quality of DiT-based generated videos. Our approach can be easily applied to most DiT-based video generation frameworks without any retraining or fine-tuning.
arXiv Detail & Related papers (2025-02-11T12:22:35Z)
STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution [42.859188375578604]
Image diffusion models have been adapted for real-world video superresolution to tackle over-smoothing issues in GAN-based methods.<n>These models struggle to maintain temporal consistency, as they are trained on static images.<n>We introduce a novel approach that leverages T2V models for real-world video super-resolution, achieving realistic spatial details and robust temporal consistency.
arXiv Detail & Related papers (2025-01-06T12:36:21Z)
ARLON: Boosting Diffusion Transformers with Autoregressive Models for Long Video Generation [83.62931466231898]
This paper presents ARLON, a framework that boosts diffusion Transformers with autoregressive models for long video generation. A latent Vector Quantized Variational Autoencoder (VQ-VAE) compresses the input latent space of the DiT model into compact visual tokens. An adaptive norm-based semantic injection module integrates the coarse discrete visual units from the AR model into the DiT model.
arXiv Detail & Related papers (2024-10-27T16:28:28Z)
ExVideo: Extending Video Diffusion Models via Parameter-Efficient Post-Tuning [36.378348127629195]
We propose a novel post-tuning methodology for video synthesis models, called ExVideo. This approach is designed to enhance the capability of current video synthesis models, allowing them to produce content over extended temporal durations. Our approach augments the model's capacity to generate up to $5times$ its original number of frames, requiring only 1.5k GPU hours of training on a dataset comprising 40k videos.
arXiv Detail & Related papers (2024-06-20T09:18:54Z)
iVideoGPT: Interactive VideoGPTs are Scalable World Models [70.02290687442624]
World models empower model-based agents to interactively explore, reason, and plan within imagined environments for real-world decision-making. This work introduces Interactive VideoGPT, a scalable autoregressive transformer framework that integrates multimodal signals--visual observations, actions, and rewards--into a sequence of tokens. iVideoGPT features a novel compressive tokenization technique that efficiently discretizes high-dimensional visual observations.
arXiv Detail & Related papers (2024-05-24T05:29:12Z)
Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution [65.91317390645163]
Upscale-A-Video is a text-guided latent diffusion framework for video upscaling. It ensures temporal coherence through two key mechanisms: locally, it integrates temporal layers into U-Net and VAE-Decoder, maintaining consistency within short sequences. It also offers greater flexibility by allowing text prompts to guide texture creation and adjustable noise levels to balance restoration and generation.
arXiv Detail & Related papers (2023-12-11T18:54:52Z)
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets [36.95521842177614]
We present Stable Video Diffusion - a latent video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video generation. We identify and evaluate three different stages for successful training of video LDMs: text-to-image pretraining, video pretraining, and high-quality video finetuning.
arXiv Detail & Related papers (2023-11-25T22:28:38Z)
LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models [133.088893990272]
We learn a high-quality text-to-video (T2V) generative model by leveraging a pre-trained text-to-image (T2I) model as a basis. We propose LaVie, an integrated video generation framework that operates on cascaded video latent diffusion models.
arXiv Detail & Related papers (2023-09-26T17:52:03Z)
GD-VDM: Generated Depth for better Diffusion-based Video Generation [18.039417502897486]
This paper proposes GD-VDM, a novel diffusion model for video generation, demonstrating promising results. We evaluated GD-VDM on the Cityscapes dataset and found that it generates more diverse and complex scenes compared to natural baselines.
arXiv Detail & Related papers (2023-06-19T21:32:10Z)
Video Probabilistic Diffusion Models in Projected Latent Space [75.4253202574722]
We propose a novel generative model for videos, coined projected latent video diffusion models (PVDM) PVDM learns a video distribution in a low-dimensional latent space and thus can be efficiently trained with high-resolution videos under limited resources.
arXiv Detail & Related papers (2023-02-15T14:22:34Z)
Latent Video Diffusion Models for High-Fidelity Long Video Generation [58.346702410885236]
We introduce lightweight video diffusion models using a low-dimensional 3D latent space. We also propose hierarchical diffusion in the latent space such that longer videos with more than one thousand frames can be produced. Our framework generates more realistic and longer videos than previous strong baselines.
arXiv Detail & Related papers (2022-11-23T18:58:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.