Related papers: CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs

CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs

URL: http://arxiv.org/abs/2406.18521v1
Date: Wed, 26 Jun 2024 17:50:11 GMT
Title: CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
Authors: Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, Alexis Chevalier, Sanjeev Arora, Danqi Chen,
Abstract summary: CharXiv is a comprehensive evaluation suite involving 2,323 charts from arXiv papers. To ensure quality, all charts and questions are handpicked, curated, and verified by human experts. Results reveal a substantial, previously underestimated gap between the reasoning skills of the strongest proprietary model.
Score: 62.84082370758761
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Chart understanding plays a pivotal role when applying Multimodal Large Language Models (MLLMs) to real-world tasks such as analyzing scientific papers or financial reports. However, existing datasets often focus on oversimplified and homogeneous charts with template-based questions, leading to an over-optimistic measure of progress. We demonstrate that although open-source models can appear to outperform strong proprietary models on these benchmarks, a simple stress test with slightly different charts or questions can deteriorate performance by up to 34.5%. In this work, we propose CharXiv, a comprehensive evaluation suite involving 2,323 natural, challenging, and diverse charts from arXiv papers. CharXiv includes two types of questions: 1) descriptive questions about examining basic chart elements and 2) reasoning questions that require synthesizing information across complex visual elements in the chart. To ensure quality, all charts and questions are handpicked, curated, and verified by human experts. Our results reveal a substantial, previously underestimated gap between the reasoning skills of the strongest proprietary model (i.e., GPT-4o), which achieves 47.1% accuracy, and the strongest open-source model (i.e., InternVL Chat V1.5), which achieves 29.2%. All models lag far behind human performance of 80.5%, underscoring weaknesses in the chart understanding capabilities of existing MLLMs. We hope CharXiv facilitates future research on MLLM chart understanding by providing a more realistic and faithful measure of progress. Project page and leaderboard: https://charxiv.github.io/

Related papers

CHAOS: Chart Analysis with Outlier Samples [31.64244745491319]
CHAOS is a benchmark to evaluate Multimodal Large Language Models (MLLMs) against chart perturbations.<n>The benchmark includes 13 state-of-the-art MLLMs divided into three groups according to the training scope and data.
arXiv Detail & Related papers (2025-05-22T19:26:49Z)
LENS: Multi-level Evaluation of Multimodal Reasoning with Large Language Models [59.0256377330646]
Lens is a benchmark with 3.4K contemporary images and 60K+ human-authored questions covering eight tasks and 12 daily scenarios.<n>This dataset intrinsically supports to evaluate MLLMs to handle image-invariable prompts, from basic perception to compositional reasoning.<n>We evaluate 15+ frontier MLLMs such as Qwen2.5-VL-72B, InternVL3-78B, GPT-4o and two reasoning models QVQ-72B-preview and Kimi-VL.
arXiv Detail & Related papers (2025-05-21T15:06:59Z)
ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models [48.994853869901974]
We conduct a case study using a synthetic dataset solvable only through visual reasoning.<n>We then introduce ChartMuseum, a new Chart Question Answering (QA) benchmark containing 1,162 expert-annotated questions.<n>Although humans achieve 93% accuracy, the best-performing model Gemini-2.5-Pro attains only 63.0%, and the leading open-source LVLM Qwen2.5-VL-72B-Instruct achieves only 38.5%.
arXiv Detail & Related papers (2025-05-19T17:59:27Z)
ChartQAPro: A More Diverse and Challenging Benchmark for Chart Question Answering [27.58410749367183]
We introduce ChartQAPro, a new benchmark that includes 1,341 charts from 157 diverse sources, spanning various chart types. Our evaluations with 21 models show a substantial performance drop for LVLMs on ChartQAPro.
arXiv Detail & Related papers (2025-04-07T21:05:06Z)
Unmasking Deceptive Visuals: Benchmarking Multimodal Large Language Models on Misleading Chart Question Answering [45.67334913593117]
Misleading visualizations pose risks to public understanding and raise safety concerns for AI systems involved in data-driven communication.<n>We benchmark 24 state-of-the-art MLLMs, analyze their performance across misleader types and chart formats, and propose a novel region-aware reasoning pipeline.<n>Our work lays the foundation for developing MLLMs that are robust, trustworthy, and aligned with the demands of responsible visual communication.
arXiv Detail & Related papers (2025-03-23T18:56:33Z)
Chart-HQA: A Benchmark for Hypothetical Question Answering in Charts [62.45232157149698]
We introduce a novel Chart Hypothetical Question Answering (HQA) task, which imposes assumptions on the same question to compel models to engage in counterfactual reasoning based on the chart content. Furthermore, we introduce HAI, a human-AI interactive data synthesis approach that leverages the efficient text-editing capabilities of MLLMs alongside human expert knowledge to generate diverse and high-quality HQA data at a low cost.
arXiv Detail & Related papers (2025-03-06T05:08:40Z)
Distill Visual Chart Reasoning Ability from LLMs to MLLMs [38.62832112530892]
Solving complex chart Q&A tasks requires advanced visual reasoning abilities in multimodal large language models (MLLMs) We propose Code-as-Intermediary Translation (CIT), a cost-effective, efficient and easily scalable data synthesis method for distilling visual reasoning abilities from LLMs to MLLMs. We employ text-based synthesizing techniques to construct chart-plotting code and produce ReachQA, a dataset containing 3k reasoning-intensive charts and 20k Q&A pairs.
arXiv Detail & Related papers (2024-10-24T14:50:42Z)
How Do Large Language Models Understand Graph Patterns? A Benchmark for Graph Pattern Comprehension [53.6373473053431]
This work introduces a benchmark to assess large language models' capabilities in graph pattern tasks. We have developed a benchmark that evaluates whether LLMs can understand graph patterns based on either terminological or topological descriptions. Our benchmark encompasses both synthetic and real datasets, and a variety of models, with a total of 11 tasks and 7 models.
arXiv Detail & Related papers (2024-10-04T04:48:33Z)
Can Large Language Models Analyze Graphs like Professionals? A Benchmark, Datasets and Models [90.98855064914379]
We introduce ProGraph, a benchmark for large language models (LLMs) to process graphs. Our findings reveal that the performance of current LLMs is unsatisfactory, with the best model achieving only 36% accuracy. We propose LLM4Graph datasets, which include crawled documents and auto-generated codes based on 6 widely used graph libraries.
arXiv Detail & Related papers (2024-09-29T11:38:45Z)
CHARTOM: A Visual Theory-of-Mind Benchmark for LLMs on Misleading Charts [26.477627174115806]
We introduce CHARTOM, a visual theory-of-mind benchmark designed to evaluate multimodal large language models' capability to understand and reason about misleading data visualizations though charts.<n> CHARTOM consists of carefully designed charts and associated questions that require a language model to not only correctly comprehend the factual content in the chart (the FACT question) but also judge whether the chart will be misleading to a human readers (the MIND question)<n>We detail the construction of our benchmark including its calibration on human performance and estimation of MIND ground truth called the Human Misleadingness Index.
arXiv Detail & Related papers (2024-08-26T17:04:23Z)
On Pre-training of Multimodal Language Models Customized for Chart Understanding [83.99377088129282]
This paper explores the training processes necessary to improve MLLMs' comprehension of charts. We introduce CHOPINLLM, an MLLM tailored for in-depth chart comprehension.
arXiv Detail & Related papers (2024-07-19T17:58:36Z)
Unraveling the Truth: Do VLMs really Understand Charts? A Deep Dive into Consistency and Robustness [47.68358935792437]
Chart question answering (CQA) is a crucial area of Visual Language Understanding. Current Visual Language Models (VLMs) in this field remain under-explored. This paper evaluates state-of-the-art VLMs on comprehensive datasets.
arXiv Detail & Related papers (2024-07-15T20:29:24Z)
Are Large Vision Language Models up to the Challenge of Chart Comprehension and Reasoning? An Extensive Investigation into the Capabilities and Limitations of LVLMs [11.19928977117624]
Natural language is a powerful complementary modality of communication for data visualizations, such as bar and line charts. Various downstream tasks have been introduced recently such as chart question answering, chart summarization, and fact-checking with charts. These tasks pose a unique challenge, demanding both vision-language reasoning and a nuanced understanding of chart data tables, visual encodings, and natural language prompts. This paper presents the first comprehensive evaluation of the recently developed large vision language models (LVLMs) for chart understanding and reasoning tasks.
arXiv Detail & Related papers (2024-06-01T01:43:30Z)
ChartX & ChartVLM: A Versatile Benchmark and Foundation Model for Complicated Chart Reasoning [54.82612435284695]
We benchmark the ability of off-the-shelf Multi-modal Large Language Models (MLLMs) in the chart domain. We construct ChartX, a multi-modal evaluation set covering 18 chart types, 7 chart tasks, 22 disciplinary topics, and high-quality chart data. We develop ChartVLM to offer a new perspective on handling multi-modal tasks that strongly depend on interpretable patterns.
arXiv Detail & Related papers (2024-02-19T14:48:23Z)
ChartBench: A Benchmark for Complex Visual Reasoning in Charts [36.492851648081405]
Multimodal Large Language Models (MLLMs) have shown impressive capabilities in image understanding and generation. Current benchmarks fail to accurately evaluate the chart comprehension of MLLMs due to limited chart types and inappropriate metrics. We propose ChartBench, a comprehensive benchmark designed to assess chart comprehension and data reliability through complex visual reasoning.
arXiv Detail & Related papers (2023-12-26T07:20:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.