AutoEval-Video: An Automatic Benchmark for Assessing Large Vision Language Models in Open-Ended Video Question Answering
- URL: http://arxiv.org/abs/2311.14906v2
- Date: Mon, 15 Jul 2024 16:42:22 GMT
- Title: AutoEval-Video: An Automatic Benchmark for Assessing Large Vision Language Models in Open-Ended Video Question Answering
- Authors: Xiuyuan Chen, Yuan Lin, Yuchen Zhang, Weiran Huang,
- Abstract summary: We propose a novel and challenging benchmark, AutoEval-Video, to comprehensively evaluate large vision-language models in open-ended video question answering.
The comprehensiveness of AutoEval-Video is demonstrated in two aspects: 1) AutoEval-Video constructs open-ended video-questions across 9 skill dimensions, addressing capabilities of perception, comprehension, and generation.
By using instance-specific rules as prompt, GPT-4, as an automatic evaluator, can achieve a stable evaluation accuracy of around 97.0%, comparable to the 94.9% - 97.5% accuracy of a human evaluator.
- Score: 6.088350050879401
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a novel and challenging benchmark, AutoEval-Video, to comprehensively evaluate large vision-language models in open-ended video question answering. The comprehensiveness of AutoEval-Video is demonstrated in two aspects: 1) AutoEval-Video constructs open-ended video-questions across 9 skill dimensions, addressing capabilities of perception, comprehension, and generation. 2) AutoEval-Video contains newly collected videos that cover over 40 distinct themes. To efficiently evaluate responses to the open-ended questions, we employ an LLM-based evaluation approach, but instead of merely providing a reference answer, we annotate unique evaluation rules for every single instance (video-question pair). To maximize the robustness of these rules, we develop a novel adversarial annotation mechanism. By using instance-specific rules as prompt, GPT-4, as an automatic evaluator, can achieve a stable evaluation accuracy of around 97.0%, comparable to the 94.9% - 97.5% accuracy of a human evaluator. Furthermore, we assess the performance of eight large vision-language models on AutoEval-Video. Among them, GPT-4V(ision) significantly outperforms other models, achieving an accuracy of 32.2%. However, there is still substantial room for improvement compared to human accuracy of 72.8%. By conducting an extensive case study, we uncover several drawbacks of GPT-4V, such as limited temporal and dynamic comprehension, and overly general responses. Code is available at https://github.com/Xiuyuan-Chen/AutoEval-Video.
Related papers
- Empowering Agentic Video Analytics Systems with Video Language Models [16.962065998310223]
Video-Language Models (VLMs) offer potential for enabling open-ended video understanding, reasoning, and analytics.
VLMs' limited context windows present challenges when processing ultra-long video content.
We introduce AVA, a VLM-powered system designed for open-ended, advanced video analytics.
arXiv Detail & Related papers (2025-05-01T02:40:23Z) - SEVERE++: Evaluating Benchmark Sensitivity in Generalization of Video Representation Learning [78.44705665291741]
We present a comprehensive evaluation of modern video self-supervised models.
We focus on generalization across four key downstream factors: domain shift, sample efficiency, action granularity, and task diversity.
Our analysis shows that, despite architectural advances, transformer-based models remain sensitive to downstream conditions.
arXiv Detail & Related papers (2025-04-08T06:00:28Z) - VideoSAVi: Self-Aligned Video Language Models without Human Supervision [0.6854849895338531]
VideoSAVi is a self-training pipeline that enables Video-LLMs to reason over video content without external supervision.
VideoSAVi achieves state-of-the-art performance on MVBench (74.0%) and delivers significant improvements.
Our model-agnostic approach is computationally efficient, requiring only 32 frames.
arXiv Detail & Related papers (2024-12-01T00:33:05Z) - VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models [111.5892290894904]
VBench is a benchmark suite that dissects "video generation quality" into specific, hierarchical, and disentangled dimensions.
We provide a dataset of human preference annotations to validate our benchmarks' alignment with human perception.
VBench++ supports evaluating text-to-video and image-to-video.
arXiv Detail & Related papers (2024-11-20T17:54:41Z) - VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation [27.814374756146698]
VideoAutoArena is an arena-style benchmark designed to automatically assess LMMs' video analysis abilities.
VideoAutoArena utilizes user simulation to generate open-ended, adaptive questions that rigorously assess model performance in video understanding.
We introduce a fault-driven evolution strategy, progressively increasing question complexity to push models toward handling more challenging video analysis scenarios.
arXiv Detail & Related papers (2024-11-20T12:48:34Z) - EVQAScore: Efficient Video Question Answering Data Evaluation [23.812020049901452]
We introduce EVQAScore, a reference-free method that leverages keyword extraction to assess both video caption and video QA data quality.
Our approach achieves state-of-the-art (SOTA) performance (32.8 for Kendall correlation and 42.3 for Spearman correlation, 4.7 and 5.9 higher than the previous method PAC-S++, for video caption evaluation)
By using EVQAScore for data selection, we achieved SOTA results with only 12.5% of the original data volume, outperforming the previous SOTA method PAC-S and 100% of data.
arXiv Detail & Related papers (2024-11-11T12:11:36Z) - Benchmarking AIGC Video Quality Assessment: A Dataset and Unified Model [54.69882562863726]
We try to systemically investigate the AIGC-VQA problem from both subjective and objective quality assessment perspectives.
We evaluate the perceptual quality of AIGC videos from three dimensions: spatial quality, temporal quality, and text-to-video alignment.
We propose a Unify Generated Video Quality assessment (UGVQ) model to comprehensively and accurately evaluate the quality of AIGC videos.
arXiv Detail & Related papers (2024-07-31T07:54:26Z) - MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding [67.56182262082729]
We introduce MMBench-Video, a quantitative benchmark to rigorously evaluate large vision-language models (LVLMs) in video understanding.
MMBench-Video incorporates lengthy videos from YouTube and employs free-form questions, mirroring practical use cases.
The benchmark is meticulously crafted to probe the models' temporal reasoning skills, with all questions human-annotated according to a carefully constructed ability taxonomy.
arXiv Detail & Related papers (2024-06-20T17:26:01Z) - LingoQA: Visual Question Answering for Autonomous Driving [14.620546951115328]
We introduce LingoQA, a novel dataset and benchmark for visual question answering in autonomous driving.
The dataset contains 28K unique short video scenarios, and 419K annotations.
On our benchmark, vision-language models respond truthfully to 59.6% of the questions compared to 96.6% for humans.
arXiv Detail & Related papers (2023-12-21T18:40:34Z) - Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating
Video-based Large Language Models [81.84810348214113]
Video-based large language models (Video-LLMs) have been recently introduced, targeting both fundamental improvements in perception and comprehension, and a diverse range of user inquiries.
To guide the development of such a model, the establishment of a robust and comprehensive evaluation system becomes crucial.
This paper proposes textitVideo-Bench, a new comprehensive benchmark along with a toolkit specifically designed for evaluating Video-LLMs.
arXiv Detail & Related papers (2023-11-27T18:59:58Z) - EvalCrafter: Benchmarking and Evaluating Large Video Generation Models [70.19437817951673]
We argue that it is hard to judge the large conditional generative models from the simple metrics since these models are often trained on very large datasets with multi-aspect abilities.
Our approach involves generating a diverse and comprehensive list of 700 prompts for text-to-video generation.
Then, we evaluate the state-of-the-art video generative models on our carefully designed benchmark, in terms of visual qualities, content qualities, motion qualities, and text-video alignment with 17 well-selected objective metrics.
arXiv Detail & Related papers (2023-10-17T17:50:46Z) - Perception Test: A Diagnostic Benchmark for Multimodal Video Models [78.64546291816117]
We propose a novel multimodal video benchmark to evaluate the perception and reasoning skills of pre-trained multimodal models.
The Perception Test focuses on skills (Memory, Abstraction, Physics, Semantics) and types of reasoning (descriptive, explanatory, predictive, counterfactual) across video, audio, and text modalities.
The benchmark probes pre-trained models for their transfer capabilities, in a zero-shot / few-shot or limited finetuning regime.
arXiv Detail & Related papers (2023-05-23T07:54:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.