UVE: Are MLLMs Unified Evaluators for AI-Generated Videos?
- URL: http://arxiv.org/abs/2503.09949v2
- Date: Fri, 21 Mar 2025 13:53:32 GMT
- Title: UVE: Are MLLMs Unified Evaluators for AI-Generated Videos?
- Authors: Yuanxin Liu, Rui Zhu, Shuhuai Ren, Jiacong Wang, Haoyuan Guo, Xu Sun, Lu Jiang,
- Abstract summary: This work investigates the feasibility of using multimodal large language models (MLLMs) as a unified evaluator for AI-generated videos (AIGVs)<n> UVE-Bench collects videos generated by state-of-the-art VGMs and provides pairwise human preference annotations across 15 evaluation aspects.<n>Our results suggest that while advanced MLLMs still lag behind human evaluators, they demonstrate promising ability in unified AIGV evaluation.
- Score: 20.199060287444162
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: With the rapid growth of video generative models (VGMs), it is essential to develop reliable and comprehensive automatic metrics for AI-generated videos (AIGVs). Existing methods either use off-the-shelf models optimized for other tasks or rely on human assessment data to train specialized evaluators. These approaches are constrained to specific evaluation aspects and are difficult to scale with the increasing demands for finer-grained and more comprehensive evaluations. To address this issue, this work investigates the feasibility of using multimodal large language models (MLLMs) as a unified evaluator for AIGVs, leveraging their strong visual perception and language understanding capabilities. To evaluate the performance of automatic metrics in unified AIGV evaluation, we introduce a benchmark called UVE-Bench. UVE-Bench collects videos generated by state-of-the-art VGMs and provides pairwise human preference annotations across 15 evaluation aspects. Using UVE-Bench, we extensively evaluate 16 MLLMs. Our empirical results suggest that while advanced MLLMs (e.g., Qwen2VL-72B and InternVL2.5-78B) still lag behind human evaluators, they demonstrate promising ability in unified AIGV evaluation, significantly surpassing existing specialized evaluation methods. Additionally, we conduct an in-depth analysis of key design choices that impact the performance of MLLM-driven evaluators, offering valuable insights for future research on AIGV evaluation. The code is available at https://github.com/bytedance/UVE.
Related papers
- VideoGen-Eval: Agent-based System for Video Generation Evaluation [54.662739174367836]
Video generation has rendered existing evaluation systems inadequate for assessing state-of-the-art models.
We propose VideoGen-Eval, an agent evaluation system that integrates content structuring, MLLM-based content judgment, and patch tools for temporal-dense dimensions.
We introduce a video generation benchmark to evaluate existing cutting-edge models and verify the effectiveness of our evaluation system.
arXiv Detail & Related papers (2025-03-30T14:12:21Z) - Human Re-ID Meets LVLMs: What can we expect? [14.370360290704197]
We compare the performance of the leading large vision-language models in the human re-identification task.
Our results confirm the strengths of LVLMs, but also their severe limitations that often lead to catastrophic answers.
arXiv Detail & Related papers (2025-01-30T19:00:40Z) - EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents [57.4686961979566]
EmbodiedEval is a comprehensive and interactive evaluation benchmark for MLLMs with embodied tasks.
It covers a broad spectrum of existing embodied AI tasks with significantly enhanced diversity.
We evaluated the state-of-the-art MLLMs on EmbodiedEval and found that they have a significant shortfall compared to human level on embodied tasks.
arXiv Detail & Related papers (2025-01-21T03:22:10Z) - OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain [62.89809156574998]
We introduce an omnidirectional and automatic RAG benchmark, OmniEval, in the financial domain.<n>Our benchmark is characterized by its multi-dimensional evaluation framework.<n>Our experiments demonstrate the comprehensiveness of OmniEval, which includes extensive test datasets.
arXiv Detail & Related papers (2024-12-17T15:38:42Z) - AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMM [54.44479359918971]
We first present AIGVQA-DB, a large-scale dataset comprising 36,576 AIGVs generated by 15 advanced text-to-video models using 1,048 prompts.
We then introduce AIGV-Assessor, a novel VQA model that leverages intricate quality attributes to capture precise video quality scores and pair video preferences.
arXiv Detail & Related papers (2024-11-26T08:43:15Z) - AutoBench-V: Can Large Vision-Language Models Benchmark Themselves? [65.92331309449015]
We introduce AutoBench-V, an automated framework for serving evaluation on demand, i.e., benchmarking LVLMs based on specific aspects of model capability.<n>Through an extensive evaluation of nine popular LVLMs across five demanded user inputs, the framework shows effectiveness and reliability.
arXiv Detail & Related papers (2024-10-28T17:55:08Z) - FVEval: Understanding Language Model Capabilities in Formal Verification of Digital Hardware [4.480157114854711]
We present FVEval, the first comprehensive benchmark for characterizing large language models (LLMs) performance in tasks pertaining to formal verification (FV)
The benchmark consists of three sub-tasks that measure LLM capabilities at different levels.
We present both collections of expert-written verification collateral and methodologies to scalably generate synthetic examples aligned with FV.
arXiv Detail & Related papers (2024-10-15T21:48:57Z) - MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)
MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.
It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z) - Prometheus-Vision: Vision-Language Model as a Judge for Fine-Grained
Evaluation [31.062433484245684]
We train Prometheus-Vision, the first open-source VLM evaluator model that can understand the user-defined score criteria during evaluation.
Prometheus-Vision shows the highest Pearson correlation with human evaluators and GPT-4V among open-source models.
arXiv Detail & Related papers (2024-01-12T14:19:23Z) - VLM-Eval: A General Evaluation on Video Large Language Models [16.92780012093112]
We introduce a unified evaluation that encompasses multiple video tasks, including captioning, question and answering, retrieval, and action recognition.
We propose a simple baseline: Video-LLaVA, which uses a single linear projection and outperforms existing video LLMs.
We evaluate video LLMs beyond academic datasets, which show encouraging recognition and reasoning capabilities in driving scenarios with only hundreds of video-instruction pairs for fine-tuning.
arXiv Detail & Related papers (2023-11-20T16:02:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.