Related papers: VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos

VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos

URL: http://arxiv.org/abs/2505.23693v1
Date: Thu, 29 May 2025 17:31:13 GMT
Title: VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos
Authors: Tingyu Song, Tongyan Hu, Guo Gan, Yilun Zhao,
Abstract summary: We propose a new benchmark, VF-Eval, which introduces four tasks-coherence validation, error awareness, error type detection, and reasoning evaluation-to comprehensively evaluate the abilities of MLLMs on AIGC videos.<n>We evaluate 13 frontier MLLMs on VF-Eval and find that even the best-performing model, GPT-4.1, struggles to achieve consistently good performance across all tasks.
Score: 5.529147924182393
License: http://creativecommons.org/licenses/by/4.0/
Abstract: MLLMs have been widely studied for video question answering recently. However, most existing assessments focus on natural videos, overlooking synthetic videos, such as AI-generated content (AIGC). Meanwhile, some works in video generation rely on MLLMs to evaluate the quality of generated videos, but the capabilities of MLLMs on interpreting AIGC videos remain largely underexplored. To address this, we propose a new benchmark, VF-Eval, which introduces four tasks-coherence validation, error awareness, error type detection, and reasoning evaluation-to comprehensively evaluate the abilities of MLLMs on AIGC videos. We evaluate 13 frontier MLLMs on VF-Eval and find that even the best-performing model, GPT-4.1, struggles to achieve consistently good performance across all tasks. This highlights the challenging nature of our benchmark. Additionally, to investigate the practical applications of VF-Eval in improving video generation, we conduct an experiment, RePrompt, demonstrating that aligning MLLMs more closely with human feedback can benefit video generation.

Related papers

VideoCap-R1: Enhancing MLLMs for Video Captioning via Structured Thinking [24.516849841624484]
This paper presents the first systematic investigation of GRPO-based RL post-training for video MLLMs.<n>We develop the VideoCap-R1, which is prompted to first perform structured thinking that analyzes video subjects.<n>Experiments demonstrate that VideoCap-R1 achieves substantial improvements over the Qwen2VL-7B baseline.
arXiv Detail & Related papers (2025-06-02T14:30:09Z)
Vad-R1: Towards Video Anomaly Reasoning via Perception-to-Cognition Chain-of-Thought [58.321044666612174]
Vad-R1 is an end-to-end MLLM-based framework for Video Anomaly Reasoning.<n>We design a Perception-to-Cognition Chain-of-Thought (P2C-CoT) that simulates the human process of recognizing anomalies.<n>We also propose an improved reinforcement learning algorithm AVA-GRPO, which explicitly incentivizes the anomaly reasoning capability of MLLMs.
arXiv Detail & Related papers (2025-05-26T12:05:16Z)
UVE: Are MLLMs Unified Evaluators for AI-Generated Videos? [20.199060287444162]
This work investigates the feasibility of using multimodal large language models (MLLMs) as a unified evaluator for AI-generated videos (AIGVs)<n> UVE-Bench collects videos generated by state-of-the-art VGMs and provides pairwise human preference annotations across 15 evaluation aspects.<n>Our results suggest that while advanced MLLMs still lag behind human evaluators, they demonstrate promising ability in unified AIGV evaluation.
arXiv Detail & Related papers (2025-03-13T01:52:27Z)
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling [56.130911402831906]
This paper aims to improve the performance of video large language models (LM) via long and rich context (LRC) modeling.<n>We develop a new version of InternVideo2.5 with focus on enhancing the original MLLMs' ability to perceive fine-grained details in videos.<n> Experimental results demonstrate this unique designML LRC greatly improves the results of video MLLM in mainstream understanding benchmarks.
arXiv Detail & Related papers (2025-01-21T18:59:00Z)
EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents [57.4686961979566]
EmbodiedEval is a comprehensive and interactive evaluation benchmark for MLLMs with embodied tasks.<n>It covers a broad spectrum of existing embodied AI tasks with significantly enhanced diversity.<n>We evaluated the state-of-the-art MLLMs on EmbodiedEval and found that they have a significant shortfall compared to human level on embodied tasks.
arXiv Detail & Related papers (2025-01-21T03:22:10Z)
SCBench: A Sports Commentary Benchmark for Video LLMs [19.13963551534595]
We develop a benchmark for sports video commentary generation for Video Large Language Models (Video LLMs)<n>$textbfSCBench$ is a six-dimensional metric specifically designed for our task, upon which we propose a GPT-based evaluation method.<n>Our results found InternVL-Chat-2 achieves the best performance with 5.44, surpassing the second-best by 1.04.
arXiv Detail & Related papers (2024-12-23T15:13:56Z)
VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI [17.763461523794806]
VidEgoThink is a benchmark for evaluating egocentric video understanding capabilities in Embodied AI. We design four key interrelated tasks: video question-answering, hierarchy planning, visual grounding and reward modeling. We conduct extensive experiments with three types of models: API-based MLLMs, open-source image-based MLLMs, and open-source video-based MLLMs.
arXiv Detail & Related papers (2024-10-15T14:08:53Z)
MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding [67.56182262082729]
We introduce MMBench-Video, a quantitative benchmark to rigorously evaluate large vision-language models (LVLMs) in video understanding. MMBench-Video incorporates lengthy videos from YouTube and employs free-form questions, mirroring practical use cases. The benchmark is meticulously crafted to probe the models' temporal reasoning skills, with all questions human-annotated according to a carefully constructed ability taxonomy.
arXiv Detail & Related papers (2024-06-20T17:26:01Z)
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis [120.67048724315619]
Video-MME is the first-ever full-spectrum, Multi-Modal Evaluation benchmark of MLLMs in Video analysis.<n>We extensively evaluate various state-of-the-art MLLMs, including GPT-4 series and Gemini 1.5 Pro, as well as open-source image models.<n>Our experiments reveal that Gemini 1.5 Pro is the best-performing commercial model, significantly outperforming the open-source models.
arXiv Detail & Related papers (2024-05-31T17:59:47Z)
How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs [98.37571997794072]
We present the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES) CVRR-ES comprehensively assesses the performance of Video-LMMs across 11 diverse real-world video dimensions. Our findings provide valuable insights for building the next generation of human-centric AI systems.
arXiv Detail & Related papers (2024-05-06T17:59:45Z)
VLM-Eval: A General Evaluation on Video Large Language Models [16.92780012093112]
We introduce a unified evaluation that encompasses multiple video tasks, including captioning, question and answering, retrieval, and action recognition. We propose a simple baseline: Video-LLaVA, which uses a single linear projection and outperforms existing video LLMs. We evaluate video LLMs beyond academic datasets, which show encouraging recognition and reasoning capabilities in driving scenarios with only hundreds of video-instruction pairs for fine-tuning.
arXiv Detail & Related papers (2023-11-20T16:02:10Z)
ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models [28.305932427801682]
We present ViLMA (Video Language Model Assessment), a task-agnostic benchmark that places the assessment of fine-grained capabilities of VidLMs on a firm footing. ViLMA offers a controlled evaluation suite that sheds light on the true potential of these models, as well as their performance gaps compared to human-level understanding. We show that current VidLMs' grounding abilities are no better than those of vision-language models which use static images.
arXiv Detail & Related papers (2023-11-13T02:13:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.