PairBench: A Systematic Framework for Selecting Reliable Judge VLMs
- URL: http://arxiv.org/abs/2502.15210v2
- Date: Mon, 24 Feb 2025 15:01:43 GMT
- Title: PairBench: A Systematic Framework for Selecting Reliable Judge VLMs
- Authors: Aarash Feizi, Sai Rajeswar, Adriana Romero-Soriano, Reihaneh Rabbany, Spandana Gella, Valentina Zantedeschi, João Monteiro,
- Abstract summary: We present PairBench, a framework that systematically evaluates large vision language models (VLMs) as customizable similarity tools.<n>Through PairBench, we introduce four metrics that represent key desiderata of similarity scores.<n>Our analysis demonstrates that no model, whether closed- or open-source, is superior on all metrics.
- Score: 16.49586486795478
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As large vision language models (VLMs) are increasingly used as automated evaluators, understanding their ability to effectively compare data pairs as instructed in the prompt becomes essential. To address this, we present PairBench, a low-cost framework that systematically evaluates VLMs as customizable similarity tools across various modalities and scenarios. Through PairBench, we introduce four metrics that represent key desiderata of similarity scores: alignment with human annotations, consistency for data pairs irrespective of their order, smoothness of similarity distributions, and controllability through prompting. Our analysis demonstrates that no model, whether closed- or open-source, is superior on all metrics; the optimal choice depends on an auto evaluator's desired behavior (e.g., a smooth vs. a sharp judge), highlighting risks of widespread adoption of VLMs as evaluators without thorough assessment. For instance, the majority of VLMs struggle with maintaining symmetric similarity scores regardless of order. Additionally, our results show that the performance of VLMs on the metrics in PairBench closely correlates with popular benchmarks, showcasing its predictive power in ranking models.
Related papers
- Review, Refine, Repeat: Understanding Iterative Decoding of AI Agents with Dynamic Evaluation and Selection [71.92083784393418]
Inference-time methods such as Best-of-N (BON) sampling offer a simple yet effective alternative to improve performance.
We propose Iterative Agent Decoding (IAD) which combines iterative refinement with dynamic candidate evaluation and selection guided by a verifier.
arXiv Detail & Related papers (2025-04-02T17:40:47Z) - Beyond the Singular: The Essential Role of Multiple Generations in Effective Benchmark Evaluation and Analysis [10.133537818749291]
Large language models (LLMs) have demonstrated significant utilities in real-world applications.<n> Benchmark evaluations are crucial for assessing the capabilities of LLMs.
arXiv Detail & Related papers (2025-02-13T03:43:33Z) - A Statistical Framework for Ranking LLM-Based Chatbots [57.59268154690763]
We propose a statistical framework that incorporates key advancements to address specific challenges in pairwise comparison analysis.<n>First, we introduce a factored tie model that enhances the ability to handle groupings of human-judged comparisons.<n>Second, we extend the framework to model covariance tiers between competitors, enabling deeper insights into performance relationships.<n>Third, we resolve optimization challenges arising from parameter non-uniqueness by introducing novel constraints.
arXiv Detail & Related papers (2024-12-24T12:54:19Z) - Ranked from Within: Ranking Large Multimodal Models for Visual Question Answering Without Labels [64.94853276821992]
Large multimodal models (LMMs) are increasingly deployed across diverse applications.<n>Traditional evaluation methods are largely dataset-centric, relying on fixed, labeled datasets and supervised metrics.<n>We explore unsupervised model ranking for LMMs by leveraging their uncertainty signals, such as softmax probabilities.
arXiv Detail & Related papers (2024-12-09T13:05:43Z) - CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution [74.41064280094064]
textbfJudger-1 is the first open-source textbfall-in-one judge LLM.
CompassJudger-1 is a general-purpose LLM that demonstrates remarkable versatility.
textbfJudgerBench is a new benchmark that encompasses various subjective evaluation tasks.
arXiv Detail & Related papers (2024-10-21T17:56:51Z) - Beyond Benchmarks: Evaluating Embedding Model Similarity for Retrieval Augmented Generation Systems [0.9976432338233169]
We evaluate the similarity of embedding models within the context of RAG systems.
We compare different families of embedding models, including proprietary ones, across five datasets.
We identify possible open-source alternatives to proprietary models, with Mistral exhibiting the highest similarity to OpenAI models.
arXiv Detail & Related papers (2024-07-11T08:24:16Z) - Compare without Despair: Reliable Preference Evaluation with Generation Separability [20.50638483427141]
We introduce a measure, separability, which estimates how suitable a test instance is for pairwise preference evaluation.
For a candidate test instance, separability samples multiple generations from a pair of models, and measures how distinguishable the two sets of generations are.
Experiments show that instances with high separability values yield more consistent preference ratings from both human- and auto-raters.
arXiv Detail & Related papers (2024-07-02T01:37:56Z) - CFaiRLLM: Consumer Fairness Evaluation in Large-Language Model Recommender System [16.84754752395103]
This work takes a critical stance on previous studies concerning fairness evaluation in Large Language Model (LLM)-based recommender systems.
We introduce CFaiRLLM, an enhanced evaluation framework that not only incorporates true preference alignment but also rigorously examines intersectional fairness.
To validate the efficacy of CFaiRLLM, we conducted extensive experiments using MovieLens and LastFM.
arXiv Detail & Related papers (2024-03-08T20:44:59Z) - Towards Open-ended Visual Quality Comparison [87.45004129101089]
We extend the edge of emerging large multi-modality models (LMMs) to advance visual quality comparison into open-ended settings.
Co-Instruct is a first-of-its-kind open-source open-ended visual quality comparer.
We demonstrate that Co-Instruct achieves in average 30% higher accuracy than state-of-the-art open-source LMMs.
arXiv Detail & Related papers (2024-02-26T15:10:56Z) - Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM
Evaluation [51.99752147380505]
This paper presents a benchmark self-evolving framework to dynamically evaluate Large Language Models (LLMs)
We utilize a multi-agent system to manipulate the context or question of original instances, reframing new evolving instances with high confidence.
Our framework widens performance discrepancies both between different models and within the same model across various tasks.
arXiv Detail & Related papers (2024-02-18T03:40:06Z) - Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks.
We instruct an LLM to self-evaluate its answers.
We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.