Related papers: WorldGPT: A Sora-Inspired Video AI Agent as Rich World Models from Text and Image Inputs

WorldGPT: A Sora-Inspired Video AI Agent as Rich World Models from Text and Image Inputs

URL: http://arxiv.org/abs/2403.07944v1
Date: Sun, 10 Mar 2024 16:09:02 GMT
Title: WorldGPT: A Sora-Inspired Video AI Agent as Rich World Models from Text and Image Inputs
Authors: Deshun Yang, Luhui Hu, Yu Tian, Zihao Li, Chris Kelly, Bang Yang, Cindy Yang, Yuexian Zou
Abstract summary: We present an innovative video generation AI agent that harnesses the power of Sora-inspired multimodal learning to build skilled world models framework. The framework includes two parts: prompt enhancer and full video translation.
Score: 53.21307319844615
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Several text-to-video diffusion models have demonstrated commendable capabilities in synthesizing high-quality video content. However, it remains a formidable challenge pertaining to maintaining temporal consistency and ensuring action smoothness throughout the generated sequences. In this paper, we present an innovative video generation AI agent that harnesses the power of Sora-inspired multimodal learning to build skilled world models framework based on textual prompts and accompanying images. The framework includes two parts: prompt enhancer and full video translation. The first part employs the capabilities of ChatGPT to meticulously distill and proactively construct precise prompts for each subsequent step, thereby guaranteeing the utmost accuracy in prompt communication and accurate execution in following model operations. The second part employ compatible with existing advanced diffusion techniques to expansively generate and refine the key frame at the conclusion of a video. Then we can expertly harness the power of leading and trailing key frames to craft videos with enhanced temporal consistency and action smoothness. The experimental results confirm that our method has strong effectiveness and novelty in constructing world models from text and image inputs over the other methods.

Related papers

T2VTextBench: A Human Evaluation Benchmark for Textual Control in Video Generation Models [12.120541052871486]
T2VTextBench is the first human-evaluation benchmark dedicated to evaluating on-screen text fidelity and temporal consistency in text-to-video models.<n>We evaluate ten state-of-the-art systems, ranging from open-source solutions to commercial offerings, and find that most struggle to generate legible, consistent text.
arXiv Detail & Related papers (2025-05-08T04:49:52Z)
SkyReels-V2: Infinite-length Film Generative Model [35.00453687783287]
We propose SkyReels-V2, an Infinite-length Film Generative Model, that synergizes Multi-modal Large Language Model (MLLM), Multi-stage Pretraining, Reinforcement Learning, and Diffusion Forcing Framework. We establish progressive-resolution pretraining for the fundamental video generation, followed by a four-stage post-training enhancement.
arXiv Detail & Related papers (2025-04-17T16:37:27Z)
Towards Multi-Task Multi-Modal Models: A Video Generative Perspective [5.495245220300184]
This thesis chronicles our endeavor to build multi-task models for generating videos and other modalities under diverse conditions. We unveil a novel approach to mapping bidirectionally between visual observation and interpretable lexical terms. Our scalable visual token representation proves beneficial across generation, compression, and understanding tasks.
arXiv Detail & Related papers (2024-05-26T23:56:45Z)
Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution [65.91317390645163]
Upscale-A-Video is a text-guided latent diffusion framework for video upscaling. It ensures temporal coherence through two key mechanisms: locally, it integrates temporal layers into U-Net and VAE-Decoder, maintaining consistency within short sequences. It also offers greater flexibility by allowing text prompts to guide texture creation and adjustable noise levels to balance restoration and generation.
arXiv Detail & Related papers (2023-12-11T18:54:52Z)
LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation [44.220329202024494]
We present a few-shot-based tuning framework, LAMP, which enables text-to-image diffusion model Learn A specific Motion Pattern with 816 videos on a single GPU. Specifically, we design a first-frame-conditioned pipeline that uses an off-the-shelf text-to-image model for content generation. To capture the features of temporal dimension, we expand the pretrained 2D convolution layers of the T2I model to our novel temporal-spatial motion learning layers.
arXiv Detail & Related papers (2023-10-16T19:03:19Z)
Video-Teller: Enhancing Cross-Modal Generation with Fusion and Decoupling [79.49128866877922]
Video-Teller is a video-language foundation model that leverages multi-modal fusion and fine-grained modality alignment. Video-Teller boosts the training efficiency by utilizing frozen pretrained vision and language modules. It capitalizes on the robust linguistic capabilities of large language models, enabling the generation of both concise and elaborate video descriptions.
arXiv Detail & Related papers (2023-10-08T03:35:27Z)
Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation [69.20173154096]
We develop a framework comprised of two functional modules, Motion Structure Retrieval and Structure-Guided Text-to-Video Synthesis. For the first module, we leverage an off-the-shelf video retrieval system and extract video depths as motion structure. For the second module, we propose a controllable video generation model that offers flexible controls over structure and characters.
arXiv Detail & Related papers (2023-07-13T17:57:13Z)
Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance [36.26032505627126]
Recent advancements in text-to-video synthesis have unveiled the potential to achieve this with prompts only. In this paper, we explore customized video generation by utilizing text as context description and motion structure. Our method, dubbed Make-Your-Video, involves joint-conditional video generation using a Latent Diffusion Model.
arXiv Detail & Related papers (2023-06-01T17:43:27Z)
Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning [50.60891619269651]
Control-A-Video is a controllable T2V diffusion model that can generate videos conditioned on text prompts and reference control maps like edge and depth maps. We propose novel strategies to incorporate content prior and motion prior into the diffusion-based generation process. Our framework generates higher-quality, more consistent videos compared to existing state-of-the-art methods in controllable text-to-video generation.
arXiv Detail & Related papers (2023-05-23T09:03:19Z)
Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators [70.17041424896507]
Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets. We propose a new task of zero-shot text-to-video generation using existing text-to-image synthesis methods. Our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data.
arXiv Detail & Related papers (2023-03-23T17:01:59Z)
TiVGAN: Text to Image to Video Generation with Step-by-Step Evolutionary Generator [34.7504057664375]
We propose a novel training framework, Text-to-Image-to-Video Generative Adversarial Network (TiVGAN), which evolves frame-by-frame and finally produces a full-length video. Step-by-step learning process helps stabilize the training and enables the creation of high-resolution video based on conditional text descriptions.
arXiv Detail & Related papers (2020-09-04T06:33:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.