Resampling Benchmark for Efficient Comprehensive Evaluation of Large Vision-Language Models
- URL: http://arxiv.org/abs/2504.09979v1
- Date: Mon, 14 Apr 2025 08:43:00 GMT
- Title: Resampling Benchmark for Efficient Comprehensive Evaluation of Large Vision-Language Models
- Authors: Teppei Suzuki, Keisuke Ozawa,
- Abstract summary: We propose an efficient evaluation protocol for large vision-language models (VLMs)<n>We construct a subset that yields results comparable to full benchmark evaluations.<n>Applying FPS to an existing benchmark improves correlation with overall evaluation results.
- Score: 18.309464845180237
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose an efficient evaluation protocol for large vision-language models (VLMs). Given their broad knowledge and reasoning capabilities, multiple benchmarks are needed for comprehensive assessment, making evaluation computationally expensive. To improve efficiency, we construct a subset that yields results comparable to full benchmark evaluations. Our benchmark classification experiments reveal that no single benchmark fully covers all challenges. We then introduce a subset construction method using farthest point sampling (FPS). Our experiments show that FPS-based benchmarks maintain a strong correlation (> 0.96) with full evaluations while using only ~1\% of the data. Additionally, applying FPS to an existing benchmark improves correlation with overall evaluation results, suggesting its potential to reduce unintended dataset biases.
Related papers
- Enhancing Classifier Evaluation: A Fairer Benchmarking Strategy Based on Ability and Robustness [0.4749981032986242]
This study introduces a novel evaluation methodology that combines Item Response Theory (IRT) with the Glicko-2 rating system.
IRT assesses classifier ability based on performance over difficult instances, while Glicko-2 updates performance metrics.
A case study using the OpenML-CC18 benchmark showed that only 15% of the datasets are truly challenging.
arXiv Detail & Related papers (2025-04-13T23:54:08Z) - Beyond the Singular: The Essential Role of Multiple Generations in Effective Benchmark Evaluation and Analysis [10.133537818749291]
Large language models (LLMs) have demonstrated significant utilities in real-world applications.<n> Benchmark evaluations are crucial for assessing the capabilities of LLMs.
arXiv Detail & Related papers (2025-02-13T03:43:33Z) - Beyond Exact Match: Semantically Reassessing Event Extraction by Large Language Models [65.8478860180793]
Event extraction has gained extensive research attention due to its broad range of applications.
Current evaluation method for event extraction relies on token-level exact match.
We propose a reliable and semantic evaluation framework for event extraction, named RAEE.
arXiv Detail & Related papers (2024-10-12T07:54:01Z) - Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts [0.6282171844772422]
Training data for many Large Language Models (LLMs) is contaminated with test data.
Public benchmark scores do not always accurately assess model properties.
arXiv Detail & Related papers (2024-10-11T20:46:56Z) - Data Efficient Evaluation of Large Language Models and Text-to-Image Models via Adaptive Sampling [3.7467864495337624]
SubLIME is a data-efficient evaluation framework for text-to-image models.
Our approach ensures statistically aligned model rankings compared to full datasets.
We leverage the HEIM leaderboard to cover 25 text-to-image models on 17 different benchmarks.
arXiv Detail & Related papers (2024-06-21T07:38:55Z) - A structured regression approach for evaluating model performance across intersectional subgroups [53.91682617836498]
Disaggregated evaluation is a central task in AI fairness assessment, where the goal is to measure an AI system's performance across different subgroups.
We introduce a structured regression approach to disaggregated evaluation that we demonstrate can yield reliable system performance estimates even for very small subgroups.
arXiv Detail & Related papers (2024-01-26T14:21:45Z) - Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity.
To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs.
We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z) - Efficient Benchmarking of Language Models [22.696230279151166]
We present the problem of Efficient Benchmarking, namely, intelligently reducing the costs of LM evaluation without compromising reliability.
Using the HELM benchmark as a test case, we investigate how different benchmark design choices affect the computation-reliability trade-off.
We propose an evaluation algorithm, that, when applied to the HELM benchmark, leads to dramatic cost savings with minimal loss of benchmark reliability.
arXiv Detail & Related papers (2023-08-22T17:59:30Z) - FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets [69.91340332545094]
We introduce FLASK, a fine-grained evaluation protocol for both human-based and model-based evaluation.
We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance.
arXiv Detail & Related papers (2023-07-20T14:56:35Z) - LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits.
Most LLMs struggle on SummEdits, with performance close to random chance.
The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z) - GREAT Score: Global Robustness Evaluation of Adversarial Perturbation using Generative Models [60.48306899271866]
We present a new framework, called GREAT Score, for global robustness evaluation of adversarial perturbation using generative models.
We show high correlation and significantly reduced cost of GREAT Score when compared to the attack-based model ranking on RobustBench.
GREAT Score can be used for remote auditing of privacy-sensitive black-box models.
arXiv Detail & Related papers (2023-04-19T14:58:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.