VideoGen-Eval: Agent-based System for Video Generation Evaluation
- URL: http://arxiv.org/abs/2503.23452v2
- Date: Sun, 27 Apr 2025 02:59:54 GMT
- Title: VideoGen-Eval: Agent-based System for Video Generation Evaluation
- Authors: Yuhang Yang, Ke Fan, Shangkun Sun, Hongxiang Li, Ailing Zeng, FeiLin Han, Wei Zhai, Wei Liu, Yang Cao, Zheng-Jun Zha,
- Abstract summary: Video generation has rendered existing evaluation systems inadequate for assessing state-of-the-art models.<n>We propose VideoGen-Eval, an agent evaluation system that integrates content structuring, MLLM-based content judgment, and patch tools for temporal-dense dimensions.<n>We introduce a video generation benchmark to evaluate existing cutting-edge models and verify the effectiveness of our evaluation system.
- Score: 54.662739174367836
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid advancement of video generation has rendered existing evaluation systems inadequate for assessing state-of-the-art models, primarily due to simple prompts that cannot showcase the model's capabilities, fixed evaluation operators struggling with Out-of-Distribution (OOD) cases, and misalignment between computed metrics and human preferences. To bridge the gap, we propose VideoGen-Eval, an agent evaluation system that integrates LLM-based content structuring, MLLM-based content judgment, and patch tools designed for temporal-dense dimensions, to achieve a dynamic, flexible, and expandable video generation evaluation. Additionally, we introduce a video generation benchmark to evaluate existing cutting-edge models and verify the effectiveness of our evaluation system. It comprises 700 structured, content-rich prompts (both T2V and I2V) and over 12,000 videos generated by 20+ models, among them, 8 cutting-edge models are selected as quantitative evaluation for the agent and human. Extensive experiments validate that our proposed agent-based evaluation system demonstrates strong alignment with human preferences and reliably completes the evaluation, as well as the diversity and richness of the benchmark.
Related papers
- Video-Bench: Human-Aligned Video Generation Benchmark [26.31594706735867]
Video generation assessment is essential for ensuring that generative models produce visually realistic, high-quality videos.
This paper introduces Video-Bench, a comprehensive benchmark featuring a rich prompt suite and extensive evaluation dimensions.
Experiments on advanced models including Sora demonstrate that Video-Bench achieves superior alignment with human preferences across all dimensions.
arXiv Detail & Related papers (2025-04-07T10:32:42Z) - AIGVE-Tool: AI-Generated Video Evaluation Toolkit with Multifaceted Benchmark [8.827755848017578]
Existing metrics lack a unified framework for systematically categorizing methodologies.<n>We introduce AIGVE-Tool, a unified framework that provides a structured taxonomy and evaluation pipeline for AI-generated video evaluation.<n>A large-scale benchmark dataset is created with five SOTA video generation models based on hand-crafted instructions and prompts.
arXiv Detail & Related papers (2025-03-18T09:36:33Z) - Evaluation Agent: Efficient and Promptable Evaluation Framework for Visual Generative Models [51.067146460271466]
Evaluation of visual generative models can be time-consuming and computationally expensive.<n>We propose the Evaluation Agent framework, which employs human-like strategies for efficient, dynamic, multi-round evaluations.<n>It offers four key advantages: 1) efficiency, 2) promptable evaluation tailored to diverse user needs, 3) explainability beyond single numerical scores, and 4) scalability across various models and tools.
arXiv Detail & Related papers (2024-12-10T18:52:39Z) - AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMM [54.44479359918971]
We first present AIGVQA-DB, a large-scale dataset comprising 36,576 AIGVs generated by 15 advanced text-to-video models using 1,048 prompts.
We then introduce AIGV-Assessor, a novel VQA model that leverages intricate quality attributes to capture precise video quality scores and pair video preferences.
arXiv Detail & Related papers (2024-11-26T08:43:15Z) - MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)
MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.
It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z) - EvalCrafter: Benchmarking and Evaluating Large Video Generation Models [70.19437817951673]
We argue that it is hard to judge the large conditional generative models from the simple metrics since these models are often trained on very large datasets with multi-aspect abilities.
Our approach involves generating a diverse and comprehensive list of 700 prompts for text-to-video generation.
Then, we evaluate the state-of-the-art video generative models on our carefully designed benchmark, in terms of visual qualities, content qualities, motion qualities, and text-video alignment with 17 well-selected objective metrics.
arXiv Detail & Related papers (2023-10-17T17:50:46Z) - SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension [27.53415400454066]
We introduce a benchmark named SEED-Bench to assess generative models.
SEED-Bench consists of 19K multiple choice questions with accurate human annotations.
We evaluate the performance of 18 models across all 12 dimensions, covering both the spatial and temporal understanding.
arXiv Detail & Related papers (2023-07-30T04:25:16Z) - Study on the Assessment of the Quality of Experience of Streaming Video [117.44028458220427]
In this paper, the influence of various objective factors on the subjective estimation of the QoE of streaming video is studied.
The paper presents standard and handcrafted features, shows their correlation and p-Value of significance.
We take SQoE-III database, so far the largest and most realistic of its kind.
arXiv Detail & Related papers (2020-12-08T18:46:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.