Related papers: ChartMind: A Comprehensive Benchmark for Complex Real-world Multimodal Chart Question Answering

ChartMind: A Comprehensive Benchmark for Complex Real-world Multimodal Chart Question Answering

URL: http://arxiv.org/abs/2505.23242v1
Date: Thu, 29 May 2025 08:46:03 GMT
Title: ChartMind: A Comprehensive Benchmark for Complex Real-world Multimodal Chart Question Answering
Authors: Jingxuan Wei, Nan Xu, Junnan Zhu, Yanni Hao, Gaowei Wu, Bihui Yu, Lei Wang,
Abstract summary: Chart question answering (CQA) has become a critical multimodal task for evaluating the reasoning capabilities of vision-language models.<n>We introduce ChartMind, a new benchmark designed for complex CQA tasks in real-world settings.<n>We propose a context-aware yet model-agnostic framework, ChartLLM, that focuses on extracting key contextual elements.
Score: 14.468507852394923
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Chart question answering (CQA) has become a critical multimodal task for evaluating the reasoning capabilities of vision-language models. While early approaches have shown promising performance by focusing on visual features or leveraging large-scale pre-training, most existing evaluations rely on rigid output formats and objective metrics, thus ignoring the complex, real-world demands of practical chart analysis. In this paper, we introduce ChartMind, a new benchmark designed for complex CQA tasks in real-world settings. ChartMind covers seven task categories, incorporates multilingual contexts, supports open-domain textual outputs, and accommodates diverse chart formats, bridging the gap between real-world applications and traditional academic benchmarks. Furthermore, we propose a context-aware yet model-agnostic framework, ChartLLM, that focuses on extracting key contextual elements, reducing noise, and enhancing the reasoning accuracy of multimodal large language models. Extensive evaluations on ChartMind and three representative public benchmarks with 14 mainstream multimodal models show our framework significantly outperforms the previous three common CQA paradigms: instruction-following, OCR-enhanced, and chain-of-thought, highlighting the importance of flexible chart understanding for real-world CQA. These findings suggest new directions for developing more robust chart reasoning in future research.

Related papers

Visual Document Understanding and Question Answering: A Multi-Agent Collaboration Framework with Test-Time Scaling [83.78874399606379]
We propose MACT, a Multi-Agent Collaboration framework with Test-Time scaling.<n>It comprises four distinct small-scale agents, with clearly defined roles and effective collaboration.<n>It shows superior performance with a smaller parameter scale without sacrificing the ability of general and mathematical tasks.
arXiv Detail & Related papers (2025-08-05T12:52:09Z)
Chart Question Answering from Real-World Analytical Narratives [5.051297047598238]
We present a new dataset for chart question answering (CQA) constructed from visualization notebooks.<n>The dataset features real-world, multi-view charts paired with natural language questions grounded in analytical narratives.
arXiv Detail & Related papers (2025-07-02T11:58:04Z)
Socratic Chart: Cooperating Multiple Agents for Robust SVG Chart Understanding [14.75820681491341]
Existing benchmarks reveal reliance on text-based shortcuts and probabilistic pattern-matching rather than genuine visual reasoning.<n>We propose Socratic Chart, a new framework that transforms chart images into Scalable Vector Graphics representations.<n>Our framework surpasses state-of-the-art models in accurately capturing chart primitives and improving reasoning performance.
arXiv Detail & Related papers (2025-04-14T00:07:39Z)
Towards Understanding Graphical Perception in Large Multimodal Models [80.44471730672801]
We leverage the theory of graphical perception to develop an evaluation framework for analyzing gaps in LMMs' perception abilities in charts.<n>We apply our framework to evaluate and diagnose the perception capabilities of state-of-the-art LMMs at three levels (chart, visual element, and pixel)
arXiv Detail & Related papers (2025-03-13T20:13:39Z)
Chart-HQA: A Benchmark for Hypothetical Question Answering in Charts [62.45232157149698]
We introduce a novel Chart Hypothetical Question Answering (HQA) task, which imposes assumptions on the same question to compel models to engage in counterfactual reasoning based on the chart content.<n> Furthermore, we introduce HAI, a human-AI interactive data synthesis approach that leverages the efficient text-editing capabilities of MLLMs alongside human expert knowledge to generate diverse and high-quality HQA data at a low cost.
arXiv Detail & Related papers (2025-03-06T05:08:40Z)
Benchmarking Multimodal RAG through a Chart-based Document Question-Answering Generation Framework [17.838177710655287]
Multimodal Retrieval-Augmented Generation (MRAG) enhances reasoning capabilities by integrating external knowledge.<n>Existing benchmarks primarily focus on simple image-text interactions, overlooking complex visual formats like charts that are prevalent in real-world applications.<n>We propose CHARt-based document question-answering GEneration (CHARGE), a framework that produces evaluation data through structured keypoint extraction, crossmodal verification, and keypoint-based generation.
arXiv Detail & Related papers (2025-02-20T18:59:42Z)
Graph-Based Multimodal Contrastive Learning for Chart Question Answering [11.828192162922436]
This work introduces a novel joint multimodal scene graph framework that explicitly models the relationships among chart components and their underlying structures.<n>The framework integrates both visual and textual graphs to capture structural and semantic characteristics.<n>A graph contrastive learning strategy aligns node representations across modalities enabling their seamless incorporation into a transformer decoder as soft prompts.
arXiv Detail & Related papers (2025-01-08T06:27:07Z)
VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation [100.06122876025063]
This paper introduces VisDoMBench, the first comprehensive benchmark designed to evaluate QA systems in multi-document settings.<n>We propose VisDoMRAG, a novel multimodal Retrieval Augmented Generation (RAG) approach that simultaneously utilizes visual and textual RAG.
arXiv Detail & Related papers (2024-12-14T06:24:55Z)
On Pre-training of Multimodal Language Models Customized for Chart Understanding [83.99377088129282]
This paper explores the training processes necessary to improve MLLMs' comprehension of charts. We introduce CHOPINLLM, an MLLM tailored for in-depth chart comprehension.
arXiv Detail & Related papers (2024-07-19T17:58:36Z)
mChartQA: A universal benchmark for multimodal Chart Question Answer based on Vision-Language Alignment and Reasoning [8.1113308714581]
This paper introduces a novel multimodal chart question-answering model. Our model integrates visual and linguistic processing, overcoming the constraints of existing methods. This approach has demonstrated superior performance on multiple public datasets.
arXiv Detail & Related papers (2024-04-02T01:28:44Z)
OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models [122.27878464009181]
We conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks. OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available.
arXiv Detail & Related papers (2023-05-13T11:28:37Z)
Classification-Regression for Chart Comprehension [16.311371103939205]
Chart question answering (CQA) is a task used for assessing chart comprehension. We propose a new model that jointly learns classification and regression. Our model's edge is particularly emphasized on questions with out-of-vocabulary answers.
arXiv Detail & Related papers (2021-11-29T18:46:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.