RefChartQA: Grounding Visual Answer on Chart Images through Instruction Tuning
- URL: http://arxiv.org/abs/2503.23131v1
- Date: Sat, 29 Mar 2025 15:50:08 GMT
- Title: RefChartQA: Grounding Visual Answer on Chart Images through Instruction Tuning
- Authors: Alexander Vogel, Omar Moured, Yufan Chen, Jiaming Zhang, Rainer Stiefelhagen,
- Abstract summary: RefChartQA is a novel benchmark that integrates Chart Question Answering (ChartQA) with visual grounding.<n>Our experiments demonstrate that incorporating spatial awareness via grounding improves response accuracy by over 15%.
- Score: 63.599057862999
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, Vision Language Models (VLMs) have increasingly emphasized document visual grounding to achieve better human-computer interaction, accessibility, and detailed understanding. However, its application to visualizations such as charts remains under-explored due to the inherent complexity of interleaved visual-numerical relationships in chart images. Existing chart understanding methods primarily focus on answering questions without explicitly identifying the visual elements that support their predictions. To bridge this gap, we introduce RefChartQA, a novel benchmark that integrates Chart Question Answering (ChartQA) with visual grounding, enabling models to refer elements at multiple granularities within chart images. Furthermore, we conduct a comprehensive evaluation by instruction-tuning 5 state-of-the-art VLMs across different categories. Our experiments demonstrate that incorporating spatial awareness via grounding improves response accuracy by over 15%, reducing hallucinations, and improving model reliability. Additionally, we identify key factors influencing text-spatial alignment, such as architectural improvements in TinyChart, which leverages a token-merging module for enhanced feature fusion. Our dataset is open-sourced for community development and further advancements. All models and code will be publicly available at https://github.com/moured/RefChartQA.
Related papers
- Socratic Chart: Cooperating Multiple Agents for Robust SVG Chart Understanding [14.75820681491341]
Existing benchmarks reveal reliance on text-based shortcuts and probabilistic pattern-matching rather than genuine visual reasoning.
We propose Socratic Chart, a new framework that transforms chart images into Scalable Vector Graphics representations.
Our framework surpasses state-of-the-art models in accurately capturing chart primitives and improving reasoning performance.
arXiv Detail & Related papers (2025-04-14T00:07:39Z) - Multimodal Graph Constrastive Learning and Prompt for ChartQA [11.828192162922436]
ChartQA presents significant challenges due to the complex distribution of chart elements and the implicit patterns embedded within the underlying data.<n>We have developed a joint multimodal scene graph for charts, explicitly representing the relationships between chart elements and their associated patterns.
arXiv Detail & Related papers (2025-01-08T06:27:07Z) - ChartAdapter: Large Vision-Language Model for Chart Summarization [13.499376163294816]
ChartAdapter is a lightweight transformer module designed to bridge the gap between charts and textual summaries.<n>By integrating ChartAdapter with an LLM, we enable end-to-end training and efficient chart summarization.
arXiv Detail & Related papers (2024-12-30T05:07:34Z) - VProChart: Answering Chart Question through Visual Perception Alignment Agent and Programmatic Solution Reasoning [13.011899331656018]
VProChart is a novel framework designed to address the challenges of Chart Question Answering (CQA)
It integrates a lightweight Visual Perception Alignment Agent (VPAgent) and a Programmatic Solution Reasoning approach.
VProChart significantly outperforms existing methods, highlighting its capability in understanding and reasoning with charts.
arXiv Detail & Related papers (2024-09-03T07:19:49Z) - MSG-Chart: Multimodal Scene Graph for ChartQA [11.828192162922436]
Automatic Chart Question Answering (ChartQA) is challenging due to the complex distribution of chart elements with patterns of the underlying data not explicitly displayed in charts.
We design a joint multimodal scene graph for charts to explicitly represent the relationships between chart elements and their patterns.
Our proposed multimodal scene graph includes a visual graph and a textual graph to jointly capture the structural and semantical knowledge from the chart.
arXiv Detail & Related papers (2024-08-09T04:11:23Z) - Advancing Chart Question Answering with Robust Chart Component Recognition [18.207819321127182]
We introduce a unified framework that enhances chart component recognition by accurately identifying and classifying components such as bars, lines, pies, titles, legends, and axes.
We also propose a novel Question-guided Deformable Co-Attention mechanism, which fuses chart features encoded by Chartformer with the given question, leveraging the question's guidance to ground the correct answer.
arXiv Detail & Related papers (2024-07-19T20:55:06Z) - On Pre-training of Multimodal Language Models Customized for Chart Understanding [83.99377088129282]
This paper explores the training processes necessary to improve MLLMs' comprehension of charts.
We introduce CHOPINLLM, an MLLM tailored for in-depth chart comprehension.
arXiv Detail & Related papers (2024-07-19T17:58:36Z) - TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning [83.58521787193293]
We present TinyChart, an efficient MLLM for chart understanding with only 3B parameters.
TinyChart overcomes two key challenges in efficient chart understanding: (1) reduce the burden of learning numerical computations through a Program-of-Thoughts (PoT) learning strategy, and (2) reduce lengthy vision feature sequences produced by the vision transformer for high-resolution images through a Vision Token Merging module.
arXiv Detail & Related papers (2024-04-25T14:23:24Z) - StructChart: On the Schema, Metric, and Augmentation for Visual Chart Understanding [54.45681512355684]
Current chart-related tasks focus on either chart perception that extracts information from the visual charts, or chart reasoning given the extracted data.<n>We introduce StructChart, a novel framework that leverages Structured Triplet Representations (STR) to achieve a unified and label-efficient approach.
arXiv Detail & Related papers (2023-09-20T12:51:13Z) - ChartReader: A Unified Framework for Chart Derendering and Comprehension
without Heuristic Rules [89.75395046894809]
We present ChartReader, a unified framework that seamlessly integrates chart derendering and comprehension tasks.
Our approach includes a transformer-based chart component detection module and an extended pre-trained vision-language model for chart-to-X tasks.
Our proposed framework can significantly reduce the manual effort involved in chart analysis, providing a step towards a universal chart understanding model.
arXiv Detail & Related papers (2023-04-05T00:25:27Z) - ConsNet: Learning Consistency Graph for Zero-Shot Human-Object
Interaction Detection [101.56529337489417]
We consider the problem of Human-Object Interaction (HOI) Detection, which aims to locate and recognize HOI instances in the form of human, action, object> in images.
We argue that multi-level consistencies among objects, actions and interactions are strong cues for generating semantic representations of rare or previously unseen HOIs.
Our model takes visual features of candidate human-object pairs and word embeddings of HOI labels as inputs, maps them into visual-semantic joint embedding space and obtains detection results by measuring their similarities.
arXiv Detail & Related papers (2020-08-14T09:11:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.