Related papers: Sora as an AGI World Model? A Complete Survey on Text-to-Video Generation

Sora as an AGI World Model? A Complete Survey on Text-to-Video Generation

URL: http://arxiv.org/abs/2403.05131v2
Date: Fri, 7 Jun 2024 07:40:07 GMT
Title: Sora as an AGI World Model? A Complete Survey on Text-to-Video Generation
Authors: Joseph Cho, Fachrina Dewi Puspitasari, Sheng Zheng, Jingyao Zheng, Lik-Hang Lee, Tae-Ho Kim, Choong Seon Hong, Chaoning Zhang,
Abstract summary: We discuss the evolution of video generation from text, starting with animating MNIST numbers to simulating the physical world with Sora. Our review into the shortcomings of Sora-generated videos pinpoints the call for more in-depth studies in various enabling aspects of video generation. We conclude that the study of the text-to-video generation may still be in its infancy, requiring contribution from the cross-discipline research community.
Score: 30.245348014602577
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The evolution of video generation from text, starting with animating MNIST numbers to simulating the physical world with Sora, has progressed at a breakneck speed over the past seven years. While often seen as a superficial expansion of the predecessor text-to-image generation model, text-to-video generation models are developed upon carefully engineered constituents. Here, we systematically discuss these elements consisting of but not limited to core building blocks (vision, language, and temporal) and supporting features from the perspective of their contributions to achieving a world model. We employ the PRISMA framework to curate 97 impactful research articles from renowned scientific databases primarily studying video synthesis using text conditions. Upon minute exploration of these manuscripts, we observe that text-to-video generation involves more intricate technologies beyond the plain extension of text-to-image generation. Our additional review into the shortcomings of Sora-generated videos pinpoints the call for more in-depth studies in various enabling aspects of video generation such as dataset, evaluation metric, efficient architecture, and human-controlled generation. Finally, we conclude that the study of the text-to-video generation may still be in its infancy, requiring contribution from the cross-discipline research community towards its advancement as the first step to realize artificial general intelligence (AGI).

Related papers

ASurvey: Spatiotemporal Consistency in Video Generation [72.82267240482874]
Video generation schemes by leveraging a dynamic visual generation method, pushes the boundaries of Artificial Intelligence Generated Content (AIGC) Recent works have aimed at addressing thetemporal consistency issue in video generation, while few literature review has been organized from this perspective. We systematically review recent advances in video generation, covering five key aspects: foundation models, information representations, generation schemes, post-processing techniques, and evaluation metrics.
arXiv Detail & Related papers (2025-02-25T05:20:51Z)
A Comprehensive Survey on Human Video Generation: Challenges, Methods, and Insights [8.192172339127657]
Human video generation aims to synthesize 2D human body video sequences with generative models given control conditions such as text, audio, and pose. Recent advancements in generative models have laid a solid foundation for the growing interest in this area. Despite the significant progress, the task of human video generation remains challenging due to the consistency of characters, the complexity of human motion, and difficulties in their relationship with the environment.
arXiv Detail & Related papers (2024-07-11T12:09:05Z)
The Lost Melody: Empirical Observations on Text-to-Video Generation From A Storytelling Perspective [4.471962177124311]
We examine text-to-video generation from a storytelling perspective, which has been hardly investigated. We propose an evaluation framework for storytelling aspects of videos, and discuss the potential future directions.
arXiv Detail & Related papers (2024-05-13T02:25:08Z)
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models [59.54172719450617]
Sora is a text-to-video generative AI model, released by OpenAI in February 2024. This paper presents a review of the model's background, related technologies, applications, remaining challenges, and future directions.
arXiv Detail & Related papers (2024-02-27T03:30:58Z)
A Survey of AI Text-to-Image and AI Text-to-Video Generators [0.4662017507844857]
Text-to-Image and Text-to-Video AI generation models are revolutionary technologies that use deep learning and natural language processing (NLP) techniques to create images and videos from textual descriptions. This paper investigates cutting-edge approaches in the discipline of Text-to-Image and Text-to-Video AI generations.
arXiv Detail & Related papers (2023-11-10T17:33:58Z)
State of the Art on Diffusion Models for Visual Computing [191.6168813012954]
This report introduces the basic mathematical concepts of diffusion models, implementation details and design choices of the popular Stable Diffusion model. We also give a comprehensive overview of the rapidly growing literature on diffusion-based generation and editing. We discuss available datasets, metrics, open challenges, and social implications.
arXiv Detail & Related papers (2023-10-11T05:32:29Z)
Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation [55.36617538438858]
We propose a novel approach that strengthens the interaction between spatial and temporal perceptions. We curate a large-scale and open-source video dataset called HD-VG-130M.
arXiv Detail & Related papers (2023-05-18T11:06:15Z)
Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators [70.17041424896507]
Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets. We propose a new task of zero-shot text-to-video generation using existing text-to-image synthesis methods. Our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data.
arXiv Detail & Related papers (2023-03-23T17:01:59Z)
Video Generation from Text Employing Latent Path Construction for Temporal Modeling [70.06508219998778]
Video generation is one of the most challenging tasks in Machine Learning and Computer Vision fields of study. In this paper, we tackle the text to video generation problem, which is a conditional form of video generation. We believe that video generation from natural language sentences will have an important impact on Artificial Intelligence.
arXiv Detail & Related papers (2021-07-29T06:28:20Z)
Pretrained Language Models for Text Generation: A Survey [46.03096493973206]
We present an overview of the major advances achieved in the topic of pretrained language models (PLMs) for text generation. We discuss how to adapt existing PLMs to model different input data and satisfy special properties in the generated text.
arXiv Detail & Related papers (2021-05-21T12:27:44Z)
A Survey of Knowledge-Enhanced Text Generation [81.24633231919137]
The goal of text generation is to make machines express in human language. Various neural encoder-decoder models have been proposed to achieve the goal by learning to map input text to output text. To address this issue, researchers have considered incorporating various forms of knowledge beyond the input text into the generation models.
arXiv Detail & Related papers (2020-10-09T06:46:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.