Judging the Judges: Can Large Vision-Language Models Fairly Evaluate Chart Comprehension and Reasoning?
- URL: http://arxiv.org/abs/2505.08468v1
- Date: Tue, 13 May 2025 11:50:08 GMT
- Title: Judging the Judges: Can Large Vision-Language Models Fairly Evaluate Chart Comprehension and Reasoning?
- Authors: Md Tahmid Rahman Laskar, Mohammed Saidul Islam, Ridwan Mahbub, Ahmed Masry, Mizanur Rahman, Amran Bhuiyan, Mir Tafseer Nayeem, Shafiq Joty, Enamul Hoque, Jimmy Huang,
- Abstract summary: We evaluate 13 open-source LVLMs as judges for diverse chart comprehension and reasoning tasks.<n>We design both pairwise and pointwise evaluation tasks covering criteria like factual correctness, informativeness, and relevancy.<n>We focus on cost-effective LVLMs suitable for both research and commercial use.
- Score: 26.909604648952616
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Charts are ubiquitous as they help people understand and reason with data. Recently, various downstream tasks, such as chart question answering, chart2text, and fact-checking, have emerged. Large Vision-Language Models (LVLMs) show promise in tackling these tasks, but their evaluation is costly and time-consuming, limiting real-world deployment. While using LVLMs as judges to assess the chart comprehension capabilities of other LVLMs could streamline evaluation processes, challenges like proprietary datasets, restricted access to powerful models, and evaluation costs hinder their adoption in industrial settings. To this end, we present a comprehensive evaluation of 13 open-source LVLMs as judges for diverse chart comprehension and reasoning tasks. We design both pairwise and pointwise evaluation tasks covering criteria like factual correctness, informativeness, and relevancy. Additionally, we analyze LVLM judges based on format adherence, positional consistency, length bias, and instruction-following. We focus on cost-effective LVLMs (<10B parameters) suitable for both research and commercial use, following a standardized evaluation protocol and rubric to measure the LVLM judge's accuracy. Experimental results reveal notable variability: while some open LVLM judges achieve GPT-4-level evaluation performance (about 80% agreement with GPT-4 judgments), others struggle (below ~10% agreement). Our findings highlight that state-of-the-art open-source LVLMs can serve as cost-effective automatic evaluators for chart-related tasks, though biases such as positional preference and length bias persist.
Related papers
- Evaluating and Mitigating LLM-as-a-judge Bias in Communication Systems [32.83708359216193]
Large Language Models (LLMs) are increasingly being used to autonomously evaluate the quality of content in communication systems.<n>This paper systematically investigates judgment biases in two LLM-as-a-judge models under the point-wise scoring setting.<n>We propose four potential mitigation strategies to ensure fair and reliable AI judging in practical communication scenarios.
arXiv Detail & Related papers (2025-10-14T12:52:29Z) - Deploying Tiny LVLM Judges for Real-World Evaluation of Chart Models: Lessons Learned and Best Practices [43.58820960865236]
Large Vision-Language Models (LVLMs) with only 7B parameters have shown promise as automated judges in chart comprehension tasks.<n>We propose two approaches to ensure cost-efficient evaluation: (i) multi-criteria prompting, which combines separate evaluation criteria into a single query, and (ii) domain-adaptive transfer learning, in which we fine-tune a 2B- parameter LVLM on synthetic judgments in a chart dataset to create the ChartJudge.
arXiv Detail & Related papers (2025-10-08T21:02:19Z) - Quantitative LLM Judges [48.676042957523045]
We propose quantitative LLM judges, which align evaluation scores of existing LLM judges to human scores in a given domain.<n>The models are trained to improve the score of the original judge by using the judge's textual evaluation and score.<n>Our experiments show that quantitative judges can effectively improve the predictive power of existing judges through post-hoc modeling.
arXiv Detail & Related papers (2025-06-03T14:44:23Z) - Chart-to-Experience: Benchmarking Multimodal LLMs for Predicting Experiential Impact of Charts [11.029722116574604]
We introduce Chart-to-Experience, a benchmark dataset comprising 36 charts, evaluated by crowdsourced workers for their impact on seven experiential factors.<n>Using the dataset as ground truth, we evaluated capabilities of state-of-the-art MLLMs on two tasks: direct prediction and pairwise comparison of charts.<n>Our findings imply that MLLMs are not as sensitive as human evaluators when assessing individual charts, but are accurate and reliable in pairwise comparisons.
arXiv Detail & Related papers (2025-05-23T01:12:57Z) - Fooling the LVLM Judges: Visual Biases in LVLM-Based Evaluation [14.521056434373213]
Large vision-language models (LVLMs) have emerged as the preferred tools for judging text-image alignment.<n>Can adversarial visual manipulations systematically fool LVLM judges into assigning unfairly inflated scores?<n>This work is the first study to address a key research question: Can adversarial visual manipulations systematically fool LVLM judges into assigning unfairly inflated scores?
arXiv Detail & Related papers (2025-05-21T08:24:28Z) - Assessing Judging Bias in Large Reasoning Models: An Empirical Study [99.86300466350013]
Large Reasoning Models (LRMs) like DeepSeek-R1 and OpenAI-o1 have demonstrated remarkable reasoning capabilities.<n>We present a benchmark comparing judging biases between LLMs and LRMs across both subjective preference-alignment datasets and objective fact-based datasets.
arXiv Detail & Related papers (2025-04-14T07:14:27Z) - Video SimpleQA: Towards Factuality Evaluation in Large Video Language Models [69.68265487134686]
Video SimpleQA is the first comprehensive benchmark tailored for factuality evaluation of LVLMs.<n>Our work distinguishes from existing video benchmarks through the following key features.<n>Answers are crafted as unambiguous and definitively correct in a short format.
arXiv Detail & Related papers (2025-03-24T17:46:09Z) - Judge as A Judge: Improving the Evaluation of Retrieval-Augmented Generation through the Judge-Consistency of Large Language Models [68.92020689188887]
Retrieval-Augmented Generation (RAG) has proven its effectiveness in alleviating hallucinations for Large Language Models (LLMs)<n>Existing automated evaluation metrics cannot fairly evaluate the outputs generated by RAG models during training and evaluation.<n>This paper introduces the Judge-Consistency (ConsJudge) method, which aims to enhance LLMs to generate more accurate evaluations for RAG models.
arXiv Detail & Related papers (2025-02-26T04:50:43Z) - From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge [32.55871325700294]
Assessment and evaluation have long been critical challenges in artificial intelligence (AI) and natural language processing (NLP)<n>Recent advancements in Large Language Models (LLMs) inspire the "LLM-as-a-judge" paradigm.
arXiv Detail & Related papers (2024-11-25T17:28:44Z) - AutoBench-V: Can Large Vision-Language Models Benchmark Themselves? [65.92331309449015]
We introduce AutoBench-V, an automated framework for serving evaluation on demand, i.e., benchmarking LVLMs based on specific aspects of model capability.<n>Through an extensive evaluation of nine popular LVLMs across five demanded user inputs, the framework shows effectiveness and reliability.
arXiv Detail & Related papers (2024-10-28T17:55:08Z) - CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution [74.41064280094064]
textbfJudger-1 is the first open-source textbfall-in-one judge LLM.
CompassJudger-1 is a general-purpose LLM that demonstrates remarkable versatility.
textbfJudgerBench is a new benchmark that encompasses various subjective evaluation tasks.
arXiv Detail & Related papers (2024-10-21T17:56:51Z) - LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks [106.09361690937618]
There is an increasing trend towards evaluating NLP models with LLMs instead of human judgments.<n>We provide JUDGE-BENCH, a collection of 20 NLP datasets with human annotations covering a broad range of evaluated properties and types of data.<n>We evaluate 11 current LLMs, covering both open-weight and proprietary models, for their ability to replicate the annotations.
arXiv Detail & Related papers (2024-06-26T14:56:13Z) - TALEC: Teach Your LLM to Evaluate in Specific Domain with In-house Criteria by Criteria Division and Zero-shot Plus Few-shot [2.186726107112913]
We propose a model-based evaluation method: TALEC.
It allows users to flexibly set their own evaluation criteria, and uses in-context learning (ICL) to teach judge model these in-house criteria.
TALEC demonstrates a strong capability to accurately reflect human preferences and achieves a correlation of over 80% with human judgments.
arXiv Detail & Related papers (2024-06-25T10:02:42Z) - Evaluating the Factuality of Large Language Models using Large-Scale Knowledge Graphs [30.179703001666173]
Factuality issue is a critical concern for Large Language Models (LLMs)
We propose GraphEval to evaluate an LLM's performance using a substantially large test dataset.
Test dataset is retrieved from a large knowledge graph with more than 10 million facts without expensive human efforts.
arXiv Detail & Related papers (2024-04-01T06:01:17Z) - LVLM-eHub: A Comprehensive Evaluation Benchmark for Large
Vision-Language Models [55.304181390027274]
This paper presents a comprehensive evaluation of publicly available large multimodal models by building a LVLM evaluation Hub (LVLM-eHub)
Our LVLM-eHub consists of $8$ representative LVLMs such as InstructBLIP and MiniGPT-4, which are thoroughly evaluated by a quantitative capability evaluation and an online arena platform.
The study reveals several innovative findings. First, instruction-tuned LVLM with massive in-domain data such as InstructBLIP heavily overfits many existing tasks, generalizing poorly in the open-world scenario.
arXiv Detail & Related papers (2023-06-15T16:39:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.