Is Your World Simulator a Good Story Presenter? A Consecutive Events-Based Benchmark for Future Long Video Generation
- URL: http://arxiv.org/abs/2412.16211v1
- Date: Tue, 17 Dec 2024 23:00:42 GMT
- Title: Is Your World Simulator a Good Story Presenter? A Consecutive Events-Based Benchmark for Future Long Video Generation
- Authors: Yiping Wang, Xuehai He, Kuan Wang, Luyao Ma, Jianwei Yang, Shuohang Wang, Simon Shaolei Du, Yelong Shen,
- Abstract summary: We introduce StoryEval, a story-oriented benchmark to assess text-to-video (T2V) models' story-completion capabilities.
StoryEval features 423 prompts spanning 7 classes, each representing short stories composed of 2-4 consecutive events.
We employ advanced vision-language models, such as GPT-4V and LLaVA-OV-Chat-72B, to verify the completion of each event in the generated videos.
- Score: 71.32108638269517
- License:
- Abstract: The current state-of-the-art video generative models can produce commercial-grade videos with highly realistic details. However, they still struggle to coherently present multiple sequential events in the stories specified by the prompts, which is foreseeable an essential capability for future long video generation scenarios. For example, top T2V generative models still fail to generate a video of the short simple story 'how to put an elephant into a refrigerator.' While existing detail-oriented benchmarks primarily focus on fine-grained metrics like aesthetic quality and spatial-temporal consistency, they fall short of evaluating models' abilities to handle event-level story presentation. To address this gap, we introduce StoryEval, a story-oriented benchmark specifically designed to assess text-to-video (T2V) models' story-completion capabilities. StoryEval features 423 prompts spanning 7 classes, each representing short stories composed of 2-4 consecutive events. We employ advanced vision-language models, such as GPT-4V and LLaVA-OV-Chat-72B, to verify the completion of each event in the generated videos, applying a unanimous voting method to enhance reliability. Our methods ensure high alignment with human evaluations, and the evaluation of 11 models reveals its challenge, with none exceeding an average story-completion rate of 50%. StoryEval provides a new benchmark for advancing T2V models and highlights the challenges and opportunities in developing next-generation solutions for coherent story-driven video generation.
Related papers
- Towards Event-oriented Long Video Understanding [101.48089908037888]
Event-Bench is an event-oriented long video understanding benchmark built on existing datasets and human annotations.
VIM is a cost-effective method that enhances video MLLMs using merged, event-intensive video instructions.
arXiv Detail & Related papers (2024-06-20T09:14:19Z) - SEINE: Short-to-Long Video Diffusion Model for Generative Transition and
Prediction [93.26613503521664]
This paper presents a short-to-long video diffusion model, SEINE, that focuses on generative transition and prediction.
We propose a random-mask video diffusion model to automatically generate transitions based on textual descriptions.
Our model generates transition videos that ensure coherence and visual quality.
arXiv Detail & Related papers (2023-10-31T17:58:17Z) - EvalCrafter: Benchmarking and Evaluating Large Video Generation Models [70.19437817951673]
We argue that it is hard to judge the large conditional generative models from the simple metrics since these models are often trained on very large datasets with multi-aspect abilities.
Our approach involves generating a diverse and comprehensive list of 700 prompts for text-to-video generation.
Then, we evaluate the state-of-the-art video generative models on our carefully designed benchmark, in terms of visual qualities, content qualities, motion qualities, and text-video alignment with 17 well-selected objective metrics.
arXiv Detail & Related papers (2023-10-17T17:50:46Z) - StoryBench: A Multifaceted Benchmark for Continuous Story Visualization [42.439670922813434]
We introduce StoryBench: a new, challenging multi-task benchmark to reliably evaluate text-to-video models.
Our benchmark includes three video generation tasks of increasing difficulty: action execution, story continuation, and story generation.
We evaluate small yet strong text-to-video baselines, and show the benefits of training on story-like data algorithmically generated from existing video captions.
arXiv Detail & Related papers (2023-08-22T17:53:55Z) - NaQ: Leveraging Narrations as Queries to Supervise Episodic Memory [92.98552727430483]
Narrations-as-Queries (NaQ) is a data augmentation strategy that transforms standard video-text narrations into training data for a video query localization model.
NaQ improves multiple top models by substantial margins (even doubling their accuracy)
We also demonstrate unique properties of our approach such as the ability to perform zero-shot and few-shot NLQ, and improved performance on queries about long-tail object categories.
arXiv Detail & Related papers (2023-01-02T16:40:15Z) - StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story
Continuation [76.44802273236081]
We develop a model StoryDALL-E for story continuation, where the generated visual story is conditioned on a source image.
We show that our retro-fitting approach outperforms GAN-based models for story continuation and facilitates copying of visual elements from the source image.
Overall, our work demonstrates that pretrained text-to-image synthesis models can be adapted for complex and low-resource tasks like story continuation.
arXiv Detail & Related papers (2022-09-13T17:47:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.