Related papers: Is Your World Simulator a Good Story Presenter? A Consecutive Events-Based Benchmark for Future Long Video Generation

Is Your World Simulator a Good Story Presenter? A Consecutive Events-Based Benchmark for Future Long Video Generation

URL: http://arxiv.org/abs/2412.16211v1
Date: Tue, 17 Dec 2024 23:00:42 GMT
Title: Is Your World Simulator a Good Story Presenter? A Consecutive Events-Based Benchmark for Future Long Video Generation
Authors: Yiping Wang, Xuehai He, Kuan Wang, Luyao Ma, Jianwei Yang, Shuohang Wang, Simon Shaolei Du, Yelong Shen,
Abstract summary: We introduce StoryEval, a story-oriented benchmark to assess text-to-video (T2V) models' story-completion capabilities.<n>StoryEval features 423 prompts spanning 7 classes, each representing short stories composed of 2-4 consecutive events.<n>We employ advanced vision-language models, such as GPT-4V and LLaVA-OV-Chat-72B, to verify the completion of each event in the generated videos.
Score: 71.32108638269517
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The current state-of-the-art video generative models can produce commercial-grade videos with highly realistic details. However, they still struggle to coherently present multiple sequential events in the stories specified by the prompts, which is foreseeable an essential capability for future long video generation scenarios. For example, top T2V generative models still fail to generate a video of the short simple story 'how to put an elephant into a refrigerator.' While existing detail-oriented benchmarks primarily focus on fine-grained metrics like aesthetic quality and spatial-temporal consistency, they fall short of evaluating models' abilities to handle event-level story presentation. To address this gap, we introduce StoryEval, a story-oriented benchmark specifically designed to assess text-to-video (T2V) models' story-completion capabilities. StoryEval features 423 prompts spanning 7 classes, each representing short stories composed of 2-4 consecutive events. We employ advanced vision-language models, such as GPT-4V and LLaVA-OV-Chat-72B, to verify the completion of each event in the generated videos, applying a unanimous voting method to enhance reliability. Our methods ensure high alignment with human evaluations, and the evaluation of 11 models reveals its challenge, with none exceeding an average story-completion rate of 50%. StoryEval provides a new benchmark for advancing T2V models and highlights the challenges and opportunities in developing next-generation solutions for coherent story-driven video generation.

Related papers

OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory [47.073128448877775]
We propose OneStory, enabling global yet compact cross-shot context modeling for consistent and scalable narrative generation.<n>OneStory reformulates MSV as a next-shot generation task, enabling autoregressive shot synthesis while leveraging pretrained image-to-video (I2V) models for strong visual conditioning.<n>OneStory achieves state-of-the-art narrative coherence across diverse and complex scenes in both text- and image-conditioned settings.
arXiv Detail & Related papers (2025-12-08T18:32:24Z)
SeqBench: Benchmarking Sequential Narrative Generation in Text-to-Video Models [9.237220559112837]
We present SeqBench, a comprehensive benchmark for evaluating sequential narrative coherence in T2V generation.<n>We use a dataset of 320 prompts spanning various narrative complexities, with 2,560 human-annotated videos generated from 8 state-of-the-art T2V models.<n>Our DTG-based metric demonstrates a strong correlation with human annotations.
arXiv Detail & Related papers (2025-10-14T23:40:57Z)
T2VWorldBench: A Benchmark for Evaluating World Knowledge in Text-to-Video Generation [12.843117062583502]
We propose T2VWorldBench, the first systematic evaluation framework for evaluating the world knowledge generation abilities of text-to-video models.<n>To address both human preference and scalable evaluation, our benchmark incorporates both human evaluation and automated evaluation using vision-language models (VLMs)<n>We evaluated the 10 most advanced text-to-video models currently available, ranging from open source to commercial models, and found that most models are unable to understand world knowledge and generate truly correct videos.
arXiv Detail & Related papers (2025-07-24T05:37:08Z)
NarrLV: Towards a Comprehensive Narrative-Centric Evaluation for Long Video Generation Models [0.17413461132662073]
The goal of long video generation tasks is not only to extend video duration but also to accurately express richer narrative content within longer videos.<n>Our proposed NarrLV is the first benchmark to comprehensively evaluate the Narrative expression capabilities of Long Video generation models.
arXiv Detail & Related papers (2025-07-15T12:19:18Z)
Towards Event-oriented Long Video Understanding [101.48089908037888]
Event-Bench is an event-oriented long video understanding benchmark built on existing datasets and human annotations. VIM is a cost-effective method that enhances video MLLMs using merged, event-intensive video instructions.
arXiv Detail & Related papers (2024-06-20T09:14:19Z)
SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction [93.26613503521664]
This paper presents a short-to-long video diffusion model, SEINE, that focuses on generative transition and prediction. We propose a random-mask video diffusion model to automatically generate transitions based on textual descriptions. Our model generates transition videos that ensure coherence and visual quality.
arXiv Detail & Related papers (2023-10-31T17:58:17Z)
EvalCrafter: Benchmarking and Evaluating Large Video Generation Models [70.19437817951673]
We argue that it is hard to judge the large conditional generative models from the simple metrics since these models are often trained on very large datasets with multi-aspect abilities. Our approach involves generating a diverse and comprehensive list of 700 prompts for text-to-video generation. Then, we evaluate the state-of-the-art video generative models on our carefully designed benchmark, in terms of visual qualities, content qualities, motion qualities, and text-video alignment with 17 well-selected objective metrics.
arXiv Detail & Related papers (2023-10-17T17:50:46Z)
StoryBench: A Multifaceted Benchmark for Continuous Story Visualization [42.439670922813434]
We introduce StoryBench: a new, challenging multi-task benchmark to reliably evaluate text-to-video models. Our benchmark includes three video generation tasks of increasing difficulty: action execution, story continuation, and story generation. We evaluate small yet strong text-to-video baselines, and show the benefits of training on story-like data algorithmically generated from existing video captions.
arXiv Detail & Related papers (2023-08-22T17:53:55Z)
The Next Chapter: A Study of Large Language Models in Storytelling [51.338324023617034]
The application of prompt-based learning with large language models (LLMs) has exhibited remarkable performance in diverse natural language processing (NLP) tasks. This paper conducts a comprehensive investigation, utilizing both automatic and human evaluation, to compare the story generation capacity of LLMs with recent models. The results demonstrate that LLMs generate stories of significantly higher quality compared to other story generation models.
arXiv Detail & Related papers (2023-01-24T02:44:02Z)
NaQ: Leveraging Narrations as Queries to Supervise Episodic Memory [92.98552727430483]
Narrations-as-Queries (NaQ) is a data augmentation strategy that transforms standard video-text narrations into training data for a video query localization model. NaQ improves multiple top models by substantial margins (even doubling their accuracy) We also demonstrate unique properties of our approach such as the ability to perform zero-shot and few-shot NLQ, and improved performance on queries about long-tail object categories.
arXiv Detail & Related papers (2023-01-02T16:40:15Z)
StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation [76.44802273236081]
We develop a model StoryDALL-E for story continuation, where the generated visual story is conditioned on a source image. We show that our retro-fitting approach outperforms GAN-based models for story continuation and facilitates copying of visual elements from the source image. Overall, our work demonstrates that pretrained text-to-image synthesis models can be adapted for complex and low-resource tasks like story continuation.
arXiv Detail & Related papers (2022-09-13T17:47:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.