ChartPoint: Guiding MLLMs with Grounding Reflection for Chart Reasoning
- URL: http://arxiv.org/abs/2512.00305v1
- Date: Sat, 29 Nov 2025 04:01:55 GMT
- Title: ChartPoint: Guiding MLLMs with Grounding Reflection for Chart Reasoning
- Authors: Zhengzhuo Xu, SiNan Du, Yiyan Qi, SiwenLu, Chengjin Xu, Chun Yuan, Jian Guo,
- Abstract summary: We propose PointCoT, which integrates reflective interaction into chain-of-thought reasoning in charts.<n>By prompting MLLMs to generate bounding boxes and re-render charts based on location annotations, we establish connections between textual reasoning steps and visual grounding regions.<n>We develop two instruction-tuned models, ChartPointQ2 and ChartPointQ2.5, which outperform state-of-the-art across several chart benchmarks.
- Score: 54.86473583610112
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal Large Language Models (MLLMs) have emerged as powerful tools for chart comprehension. However, they heavily rely on extracted content via OCR, which leads to numerical hallucinations when chart textual annotations are sparse. While existing methods focus on scaling instructions, they fail to address the fundamental challenge, i.e., reasoning with visual perception. In this paper, we identify a critical observation: MLLMs exhibit weak grounding in chart elements and proportional relationships, as evidenced by their inability to localize key positions to match their reasoning. To bridge this gap, we propose PointCoT, which integrates reflective interaction into chain-of-thought reasoning in charts. By prompting MLLMs to generate bounding boxes and re-render charts based on location annotations, we establish connections between textual reasoning steps and visual grounding regions. We further introduce an automated pipeline to construct ChartPoint-SFT-62k, a dataset featuring 19.2K high-quality chart samples with step-by-step CoT, bounding box, and re-rendered visualizations. Leveraging this data, we develop two instruction-tuned models, ChartPointQ2 and ChartPointQ2.5, which outperform state-of-the-art across several chart benchmarks, e.g., +5.04\% on ChartBench.
Related papers
- START: Spatial and Textual Learning for Chart Understanding [11.769123092079203]
We propose START, the Spatial and Textual learning for chART understanding.<n>We introduce (i) chart-element grounding and (ii) chart-to-code generation to strengthen an MLLM's understanding of both chart visual layout and data details.<n>Code, data and models will be publicly available.
arXiv Detail & Related papers (2025-12-08T05:43:14Z) - ChartAB: A Benchmark for Chart Grounding & Dense Alignment [17.16234793106]
We introduce a novel "ChartAlign Benchmark (ChartAB)" to provide a comprehensive evaluation of vision-language models (VLMs)<n>By incorporating a novel two-stage inference workflow, the benchmark can further evaluate VLMs capability to align and compare elements/attributes across two charts.<n>Our analysis of evaluations reveals new insights into their perception biases, weaknesses, robustness, and hallucinations in chart understanding.
arXiv Detail & Related papers (2025-10-30T17:56:31Z) - ChartLens: Fine-grained Visual Attribution in Charts [106.44872805609673]
Post-Hoc Visual Attribution for Charts identifies fine-grained chart elements that validate a given chart-associated response.<n>We propose ChartLens, a novel chart attribution algorithm that uses segmentation-based techniques to identify chart objects.<n>Our evaluations show that ChartLens improves fine-grained attributions by 26-66%.
arXiv Detail & Related papers (2025-05-25T23:17:32Z) - Socratic Chart: Cooperating Multiple Agents for Robust SVG Chart Understanding [14.75820681491341]
Existing benchmarks reveal reliance on text-based shortcuts and probabilistic pattern-matching rather than genuine visual reasoning.<n>We propose Socratic Chart, a new framework that transforms chart images into Scalable Vector Graphics representations.<n>Our framework surpasses state-of-the-art models in accurately capturing chart primitives and improving reasoning performance.
arXiv Detail & Related papers (2025-04-14T00:07:39Z) - RefChartQA: Grounding Visual Answer on Chart Images through Instruction Tuning [63.599057862999]
RefChartQA is a novel benchmark that integrates Chart Question Answering (ChartQA) with visual grounding.<n>Our experiments demonstrate that incorporating spatial awareness via grounding improves response accuracy by over 15%.
arXiv Detail & Related papers (2025-03-29T15:50:08Z) - Distill Visual Chart Reasoning Ability from LLMs to MLLMs [64.32993770646165]
Solving complex chart Q&A tasks requires advanced visual reasoning abilities in multimodal large language models (MLLMs)<n>We propose Code-as-Intermediary Translation (CIT), a cost-effective, efficient and scalable data synthesis method for distilling visual reasoning abilities from LLMs to MLLMs.<n>ReachQA is a dataset containing 3k reasoning-intensive charts and 20k Q&A pairs to enhance both recognition and reasoning abilities of MLLMs.
arXiv Detail & Related papers (2024-10-24T14:50:42Z) - TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning [83.58521787193293]
We present TinyChart, an efficient MLLM for chart understanding with only 3B parameters.
TinyChart overcomes two key challenges in efficient chart understanding: (1) reduce the burden of learning numerical computations through a Program-of-Thoughts (PoT) learning strategy, and (2) reduce lengthy vision feature sequences produced by the vision transformer for high-resolution images through a Vision Token Merging module.
arXiv Detail & Related papers (2024-04-25T14:23:24Z) - ChartBench: A Benchmark for Complex Visual Reasoning in Charts [36.492851648081405]
Multimodal Large Language Models (MLLMs) have shown impressive capabilities in image understanding and generation.
Current benchmarks fail to accurately evaluate the chart comprehension of MLLMs due to limited chart types and inappropriate metrics.
We propose ChartBench, a comprehensive benchmark designed to assess chart comprehension and data reliability through complex visual reasoning.
arXiv Detail & Related papers (2023-12-26T07:20:55Z) - ChartReader: A Unified Framework for Chart Derendering and Comprehension
without Heuristic Rules [89.75395046894809]
We present ChartReader, a unified framework that seamlessly integrates chart derendering and comprehension tasks.
Our approach includes a transformer-based chart component detection module and an extended pre-trained vision-language model for chart-to-X tasks.
Our proposed framework can significantly reduce the manual effort involved in chart analysis, providing a step towards a universal chart understanding model.
arXiv Detail & Related papers (2023-04-05T00:25:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.