Unmasking Deceptive Visuals: Benchmarking Multimodal Large Language Models on Misleading Chart Question Answering
- URL: http://arxiv.org/abs/2503.18172v5
- Date: Sat, 20 Sep 2025 08:48:39 GMT
- Title: Unmasking Deceptive Visuals: Benchmarking Multimodal Large Language Models on Misleading Chart Question Answering
- Authors: Zixin Chen, Sicheng Song, Kashun Shum, Yanna Lin, Rui Sheng, Weiqi Wang, Huamin Qu,
- Abstract summary: Misleading visualizations pose risks to public understanding and raise safety concerns for AI systems involved in data-driven communication.<n>We benchmark 24 state-of-the-art MLLMs, analyze their performance across misleader types and chart formats, and propose a novel region-aware reasoning pipeline.<n>Our work lays the foundation for developing MLLMs that are robust, trustworthy, and aligned with the demands of responsible visual communication.
- Score: 45.67334913593117
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Misleading visualizations, which manipulate chart representations to support specific claims, can distort perception and lead to incorrect conclusions. Despite decades of research, they remain a widespread issue, posing risks to public understanding and raising safety concerns for AI systems involved in data-driven communication. While recent multimodal large language models (MLLMs) show strong chart comprehension abilities, their capacity to detect and interpret misleading charts remains unexplored. We introduce Misleading ChartQA benchmark, a large-scale multimodal dataset designed to evaluate MLLMs on misleading chart reasoning. It contains 3,026 curated examples spanning 21 misleader types and 10 chart types, each with standardized chart code, CSV data, multiple-choice questions, and labeled explanations, validated through iterative MLLM checks and expert human review. We benchmark 24 state-of-the-art MLLMs, analyze their performance across misleader types and chart formats, and propose a novel region-aware reasoning pipeline that enhances model accuracy. Our work lays the foundation for developing MLLMs that are robust, trustworthy, and aligned with the demands of responsible visual communication.
Related papers
- ChartAttack: Testing the Vulnerability of LLMs to Malicious Prompting in Chart Generation [51.49421299447412]
Multimodal large language models (MLLMs) are increasingly used to automate chart generation from data tables.<n>We introduce ChartAttack, a framework for evaluating how MLLMs can be misused to generate misleading charts at scale.
arXiv Detail & Related papers (2026-01-19T11:57:48Z) - ChartM$^3$: Benchmarking Chart Editing with Multimodal Instructions [65.21061221740388]
We introduce a novel paradigm for multimodal chart editing, where user intent is expressed through a combination of natural language and visual indicators.<n>We present Chart$textM3$, a new benchmark for Multimodal chart editing with Multi-level complexity and Multi-perspective evaluation.
arXiv Detail & Related papers (2025-07-25T13:30:14Z) - Chart-HQA: A Benchmark for Hypothetical Question Answering in Charts [62.45232157149698]
We introduce a novel Chart Hypothetical Question Answering (HQA) task, which imposes assumptions on the same question to compel models to engage in counterfactual reasoning based on the chart content.
Furthermore, we introduce HAI, a human-AI interactive data synthesis approach that leverages the efficient text-editing capabilities of MLLMs alongside human expert knowledge to generate diverse and high-quality HQA data at a low cost.
arXiv Detail & Related papers (2025-03-06T05:08:40Z) - Protecting multimodal large language models against misleading visualizations [94.71976205962527]
We introduce the first inference-time methods to improve performance on misleading visualizations.
We find that MLLM question-answering accuracy drops on average to the level of a random baseline.
arXiv Detail & Related papers (2025-02-27T20:22:34Z) - ChartCitor: Multi-Agent Framework for Fine-Grained Chart Visual Attribution [47.79080056618323]
We present ChartCitor, a multi-agent framework that provides fine-grained bounding box citations by identifying supporting evidence within chart images.
The system orchestrates LLM agents to perform chart-to-table extraction, answer reformulation, table augmentation, evidence retrieval through pre-filtering and re-ranking, and table-to-chart mapping.
arXiv Detail & Related papers (2025-02-03T02:00:51Z) - Distill Visual Chart Reasoning Ability from LLMs to MLLMs [64.32993770646165]
Solving complex chart Q&A tasks requires advanced visual reasoning abilities in multimodal large language models (MLLMs)<n>We propose Code-as-Intermediary Translation (CIT), a cost-effective, efficient and scalable data synthesis method for distilling visual reasoning abilities from LLMs to MLLMs.<n>ReachQA is a dataset containing 3k reasoning-intensive charts and 20k Q&A pairs to enhance both recognition and reasoning abilities of MLLMs.
arXiv Detail & Related papers (2024-10-24T14:50:42Z) - MultiChartQA: Benchmarking Vision-Language Models on Multi-Chart Problems [18.188725200923333]
Existing benchmarks for chart-related tasks fall short in capturing the complexity of real-world multi-chart scenarios.<n>We introduce MultiChartQA, a benchmark that evaluates MLLMs' capabilities in four key areas: direct question answering, parallel question answering, comparative reasoning, and sequential reasoning.<n>Our results highlight the challenges in multi-chart comprehension and the potential of MultiChartQA to drive advancements in this field.
arXiv Detail & Related papers (2024-10-18T05:15:50Z) - CHARTOM: A Visual Theory-of-Mind Benchmark for LLMs on Misleading Charts [26.477627174115806]
We introduce CHARTOM, a visual theory-of-mind benchmark designed to evaluate multimodal large language models' capability to understand and reason about misleading data visualizations though charts.<n> CHARTOM consists of carefully designed charts and associated questions that require a language model to not only correctly comprehend the factual content in the chart (the FACT question) but also judge whether the chart will be misleading to a human readers (the MIND question)<n>We detail the construction of our benchmark including its calibration on human performance and estimation of MIND ground truth called the Human Misleadingness Index.
arXiv Detail & Related papers (2024-08-26T17:04:23Z) - Revisiting Multi-Modal LLM Evaluation [29.094387692681337]
We pioneer evaluating recent MLLMs (LLaVA 1.5, LLaVA-NeXT, BLIP2, InstructBLIP, GPT-4V, and GPT-4o) on datasets designed to address weaknesses in earlier ones.
Our code is integrated into the widely used LAVIS framework for MLLM evaluation, enabling the rapid assessment of future MLLMs.
arXiv Detail & Related papers (2024-08-09T20:55:46Z) - How Good (Or Bad) Are LLMs at Detecting Misleading Visualizations? [35.79617496973775]
Misleading charts can distort the viewer's perception of data, leading to misinterpretations and decisions based on false information.
The development of effective automatic detection methods for misleading charts is an urgent field of research.
The recent advancement of multimodal Large Language Models has introduced a promising direction for addressing this challenge.
arXiv Detail & Related papers (2024-07-24T14:02:20Z) - On Pre-training of Multimodal Language Models Customized for Chart Understanding [83.99377088129282]
This paper explores the training processes necessary to improve MLLMs' comprehension of charts.
We introduce CHOPINLLM, an MLLM tailored for in-depth chart comprehension.
arXiv Detail & Related papers (2024-07-19T17:58:36Z) - CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs [62.84082370758761]
CharXiv is a comprehensive evaluation suite involving 2,323 charts from arXiv papers.
To ensure quality, all charts and questions are handpicked, curated, and verified by human experts.
Results reveal a substantial, previously underestimated gap between the reasoning skills of the strongest proprietary model.
arXiv Detail & Related papers (2024-06-26T17:50:11Z) - ChartX & ChartVLM: A Versatile Benchmark and Foundation Model for Complicated Chart Reasoning [55.22996841790139]
We benchmark the ability of off-the-shelf Multi-modal Large Language Models (MLLMs) in the chart domain.<n>We construct ChartX, a multi-modal evaluation set covering 18 chart types, 7 chart tasks, 22 disciplinary topics, and high-quality chart data.<n>We develop ChartVLM to offer a new perspective on handling multi-modal tasks that strongly depend on interpretable patterns.
arXiv Detail & Related papers (2024-02-19T14:48:23Z) - ChartBench: A Benchmark for Complex Visual Reasoning in Charts [36.492851648081405]
Multimodal Large Language Models (MLLMs) have shown impressive capabilities in image understanding and generation.
Current benchmarks fail to accurately evaluate the chart comprehension of MLLMs due to limited chart types and inappropriate metrics.
We propose ChartBench, a comprehensive benchmark designed to assess chart comprehension and data reliability through complex visual reasoning.
arXiv Detail & Related papers (2023-12-26T07:20:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.