ASurvey: Spatiotemporal Consistency in Video Generation
- URL: http://arxiv.org/abs/2502.17863v1
- Date: Tue, 25 Feb 2025 05:20:51 GMT
- Title: ASurvey: Spatiotemporal Consistency in Video Generation
- Authors: Zhiyu Yin, Kehai Chen, Xuefeng Bai, Ruili Jiang, Juntao Li, Hongdong Li, Jin Liu, Yang Xiang, Jun Yu, Min Zhang,
- Abstract summary: Video generation schemes by leveraging a dynamic visual generation method, pushes the boundaries of Artificial Intelligence Generated Content (AIGC)<n>Recent works have aimed at addressing thetemporal consistency issue in video generation, while few literature review has been organized from this perspective.<n>We systematically review recent advances in video generation, covering five key aspects: foundation models, information representations, generation schemes, post-processing techniques, and evaluation metrics.
- Score: 72.82267240482874
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video generation, by leveraging a dynamic visual generation method, pushes the boundaries of Artificial Intelligence Generated Content (AIGC). Video generation presents unique challenges beyond static image generation, requiring both high-quality individual frames and temporal coherence to maintain consistency across the spatiotemporal sequence. Recent works have aimed at addressing the spatiotemporal consistency issue in video generation, while few literature review has been organized from this perspective. This gap hinders a deeper understanding of the underlying mechanisms for high-quality video generation. In this survey, we systematically review the recent advances in video generation, covering five key aspects: foundation models, information representations, generation schemes, post-processing techniques, and evaluation metrics. We particularly focus on their contributions to maintaining spatiotemporal consistency. Finally, we discuss the future directions and challenges in this field, hoping to inspire further efforts to advance the development of video generation.
Related papers
- RAGME: Retrieval Augmented Video Generation for Enhanced Motion Realism [73.38167494118746]
We propose a framework to improve the realism of motion in generated videos.
We advocate for the incorporation of a retrieval mechanism during the generation phase.
Our pipeline is designed to apply to any text-to-video diffusion model.
arXiv Detail & Related papers (2025-04-09T08:14:05Z) - Resource-Efficient Motion Control for Video Generation via Dynamic Mask Guidance [2.5941932242768457]
Mask-guided video generation can control video generation through mask motion sequences.
Our model enhances existing architectures by incorporating foreground masks for precise text-position matching and motion trajectory control.
This approach excels in various video generation tasks, such as video editing and generating artistic videos, outperforming previous methods in terms of consistency and quality.
arXiv Detail & Related papers (2025-03-24T06:53:08Z) - Enhance-A-Video: Better Generated Video for Free [57.620595159855064]
We introduce a training-free approach to enhance the coherence and quality of DiT-based generated videos.<n>Our approach can be easily applied to most DiT-based video generation frameworks without any retraining or fine-tuning.
arXiv Detail & Related papers (2025-02-11T12:22:35Z) - Video Is Worth a Thousand Images: Exploring the Latest Trends in Long Video Generation [2.4240014793575138]
As of this writing, OpenAI's Sora, the current state-of-the-art system, is still limited to producing videos that are up to one minute in length.<n>In this survey, we examine the current landscape of long video generation, covering techniques like GANs and diffusion models, video generation strategies, large-scale training datasets, quality metrics for evaluating long videos, and future research areas to address the limitations of the existing video generation capabilities.
arXiv Detail & Related papers (2024-12-24T21:24:41Z) - DAWN: Dynamic Frame Avatar with Non-autoregressive Diffusion Framework for Talking Head Video Generation [50.66658181705527]
We present DAWN, a framework that enables all-at-once generation of dynamic-length video sequences.
DAWN consists of two main components: (1) audio-driven holistic facial dynamics generation in the latent motion space, and (2) audio-driven head pose and blink generation.
Our method generates authentic and vivid videos with precise lip motions, and natural pose/blink movements.
arXiv Detail & Related papers (2024-10-17T16:32:36Z) - A Comprehensive Survey on Human Video Generation: Challenges, Methods, and Insights [8.192172339127657]
Human video generation aims to synthesize 2D human body video sequences with generative models given control conditions such as text, audio, and pose.
Recent advancements in generative models have laid a solid foundation for the growing interest in this area.
Despite the significant progress, the task of human video generation remains challenging due to the consistency of characters, the complexity of human motion, and difficulties in their relationship with the environment.
arXiv Detail & Related papers (2024-07-11T12:09:05Z) - Video-to-Audio Generation with Hidden Alignment [27.11625918406991]
We offer insights into the video-to-audio generation paradigm, focusing on vision encoders, auxiliary embeddings, and data augmentation techniques.
We demonstrate our model exhibits state-of-the-art video-to-audio generation capabilities.
arXiv Detail & Related papers (2024-07-10T08:40:39Z) - Sora as an AGI World Model? A Complete Survey on Text-to-Video Generation [30.245348014602577]
We discuss the evolution of video generation from text, starting with animating MNIST numbers to simulating the physical world with Sora.
Our review into the shortcomings of Sora-generated videos pinpoints the call for more in-depth studies in various enabling aspects of video generation.
We conclude that the study of the text-to-video generation may still be in its infancy, requiring contribution from the cross-discipline research community.
arXiv Detail & Related papers (2024-03-08T07:58:13Z) - Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation [55.36617538438858]
We propose a novel approach that strengthens the interaction between spatial and temporal perceptions.
We curate a large-scale and open-source video dataset called HD-VG-130M.
arXiv Detail & Related papers (2023-05-18T11:06:15Z) - Video Generation from Text Employing Latent Path Construction for
Temporal Modeling [70.06508219998778]
Video generation is one of the most challenging tasks in Machine Learning and Computer Vision fields of study.
In this paper, we tackle the text to video generation problem, which is a conditional form of video generation.
We believe that video generation from natural language sentences will have an important impact on Artificial Intelligence.
arXiv Detail & Related papers (2021-07-29T06:28:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.