Related papers: EncQA: Benchmarking Vision-Language Models on Visual Encodings for Charts

EncQA: Benchmarking Vision-Language Models on Visual Encodings for Charts

URL: http://arxiv.org/abs/2508.04650v1
Date: Wed, 06 Aug 2025 17:17:46 GMT
Title: EncQA: Benchmarking Vision-Language Models on Visual Encodings for Charts
Authors: Kushin Mukherjee, Donghao Ren, Dominik Moritz, Yannick Assogba,
Abstract summary: Multimodal vision-language models (VLMs) continue to achieve ever-improving scores on chart understanding benchmarks.<n>We introduce EncQA, a novel benchmark designed to provide systematic coverage of visual encodings and analytic tasks.<n>Our evaluation of 9 state-of-the-art VLMs reveals that performance varies significantly across encodings within the same task, as well as across tasks.
Score: 13.788477482875855
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal vision-language models (VLMs) continue to achieve ever-improving scores on chart understanding benchmarks. Yet, we find that this progress does not fully capture the breadth of visual reasoning capabilities essential for interpreting charts. We introduce EncQA, a novel benchmark informed by the visualization literature, designed to provide systematic coverage of visual encodings and analytic tasks that are crucial for chart understanding. EncQA provides 2,076 synthetic question-answer pairs, enabling balanced coverage of six visual encoding channels (position, length, area, color quantitative, color nominal, and shape) and eight tasks (find extrema, retrieve value, find anomaly, filter values, compute derived value exact, compute derived value relative, correlate values, and correlate values relative). Our evaluation of 9 state-of-the-art VLMs reveals that performance varies significantly across encodings within the same task, as well as across tasks. Contrary to expectations, we observe that performance does not improve with model size for many task-encoding pairs. Our results suggest that advancing chart understanding requires targeted strategies addressing specific visual reasoning gaps, rather than solely scaling up model or dataset size.

Related papers

ChartAB: A Benchmark for Chart Grounding & Dense Alignment [17.16234793106]
We introduce a novel "ChartAlign Benchmark (ChartAB)" to provide a comprehensive evaluation of vision-language models (VLMs)<n>By incorporating a novel two-stage inference workflow, the benchmark can further evaluate VLMs capability to align and compare elements/attributes across two charts.<n>Our analysis of evaluations reveals new insights into their perception biases, weaknesses, robustness, and hallucinations in chart understanding.
arXiv Detail & Related papers (2025-10-30T17:56:31Z)
Socratic Chart: Cooperating Multiple Agents for Robust SVG Chart Understanding [14.75820681491341]
Existing benchmarks reveal reliance on text-based shortcuts and probabilistic pattern-matching rather than genuine visual reasoning.<n>We propose Socratic Chart, a new framework that transforms chart images into Scalable Vector Graphics representations.<n>Our framework surpasses state-of-the-art models in accurately capturing chart primitives and improving reasoning performance.
arXiv Detail & Related papers (2025-04-14T00:07:39Z)
RefChartQA: Grounding Visual Answer on Chart Images through Instruction Tuning [63.599057862999]
RefChartQA is a novel benchmark that integrates Chart Question Answering (ChartQA) with visual grounding.<n>Our experiments demonstrate that incorporating spatial awareness via grounding improves response accuracy by over 15%.
arXiv Detail & Related papers (2025-03-29T15:50:08Z)
Towards Understanding Graphical Perception in Large Multimodal Models [80.44471730672801]
We leverage the theory of graphical perception to develop an evaluation framework for analyzing gaps in LMMs' perception abilities in charts.<n>We apply our framework to evaluate and diagnose the perception capabilities of state-of-the-art LMMs at three levels (chart, visual element, and pixel)
arXiv Detail & Related papers (2025-03-13T20:13:39Z)
Why Vision Language Models Struggle with Visual Arithmetic? Towards Enhanced Chart and Geometry Understanding [94.64781599202882]
Vision Language Models (VLMs) have achieved remarkable progress in multimodal tasks.<n>They often struggle with visual arithmetic, seemingly simple capabilities like object counting or length comparison.<n>We propose CogAlign, a novel post-training strategy inspired by Piaget's theory of cognitive development.
arXiv Detail & Related papers (2025-02-17T06:54:49Z)
HMGIE: Hierarchical and Multi-Grained Inconsistency Evaluation for Vision-Language Data Cleansing [54.970275599061594]
We design an adaptive evaluation framework, called Hierarchical and Multi-Grained Inconsistency Evaluation (HMGIE)<n>HMGIE can provide multi-grained evaluations covering both accuracy and completeness for various image-caption pairs.<n>To verify the efficacy and flexibility of the proposed framework, we construct MVTID, an image-caption dataset with diverse types and granularities of inconsistencies.
arXiv Detail & Related papers (2024-12-07T15:47:49Z)
VisGraphVar: A Benchmark Generator for Assessing Variability in Graph Analysis Using Large Vision-Language Models [1.597617022056624]
Large Vision-Language Models (LVLMs) are increasingly capable of tackling abstract visual tasks. We introduce VisGraphVar, a customizable benchmark generator able to produce graph images for seven task categories. We show that variations in visual attributes of images (e.g., node labeling and layout) and the deliberate inclusion of visual imperfections significantly affect model performance.
arXiv Detail & Related papers (2024-11-22T10:10:53Z)
RealCQA-V2 : Visual Premise Proving A Manual COT Dataset for Charts [2.9201864249313383]
We introduce Visual Premise Proving, a novel task tailored to refine the process of chart question answering. This approach represents a departure from conventional accuracy-based evaluation methods. A model adept at reasoning is expected to demonstrate proficiency in both data retrieval and the structural understanding of charts.
arXiv Detail & Related papers (2024-10-29T19:32:53Z)
Distill Visual Chart Reasoning Ability from LLMs to MLLMs [64.32993770646165]
Solving complex chart Q&A tasks requires advanced visual reasoning abilities in multimodal large language models (MLLMs)<n>We propose Code-as-Intermediary Translation (CIT), a cost-effective, efficient and scalable data synthesis method for distilling visual reasoning abilities from LLMs to MLLMs.<n>ReachQA is a dataset containing 3k reasoning-intensive charts and 20k Q&A pairs to enhance both recognition and reasoning abilities of MLLMs.
arXiv Detail & Related papers (2024-10-24T14:50:42Z)
Charting the Future: Using Chart Question-Answering for Scalable Evaluation of LLM-Driven Data Visualizations [7.32619928577074]
We propose a novel framework that leverages Visual Question Answering (VQA) models to automate the evaluation of LLM-generated data visualizations. Our results indicate that LLM-generated charts do not match the accuracy of the original non-LLM-generated charts based on VQA performance measures.
arXiv Detail & Related papers (2024-09-27T14:02:48Z)
On Pre-training of Multimodal Language Models Customized for Chart Understanding [83.99377088129282]
This paper explores the training processes necessary to improve MLLMs' comprehension of charts.<n>We introduce CHOPINLLM, an MLLM tailored for in-depth chart comprehension.
arXiv Detail & Related papers (2024-07-19T17:58:36Z)
FlowLearn: Evaluating Large Vision-Language Models on Flowchart Understanding [52.35520385083425]
FlowLearn dataset is a resource tailored to enhance the understanding of flowcharts. The scientific subset contains 3,858 flowcharts sourced from scientific literature. The simulated subset contains 10,000 flowcharts created using a customizable script.
arXiv Detail & Related papers (2024-07-06T20:58:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.