What Lies Beneath: A Call for Distribution-based Visual Question & Answer Datasets
- URL: http://arxiv.org/abs/2601.22218v1
- Date: Thu, 29 Jan 2026 19:00:01 GMT
- Title: What Lies Beneath: A Call for Distribution-based Visual Question & Answer Datasets
- Authors: Jill P. Naiman, Daniel J. Evans, JooYoung Seo,
- Abstract summary: We argue for a dedicated VQA benchmark for scientific charts where there is no 1-to-1 correspondence between chart marks and underlying data.<n>We generate synthetic histogram charts based on ground truth data, and ask both humans and a large reasoning model questions where precise answers depend on access to the underlying data.<n>We release the open-source dataset, including figures, underlying data, distribution parameters used to generate the data, and bounding boxes for all figure marks and text for future research.
- Score: 3.1351527202068445
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Visual Question Answering (VQA) has become an important benchmark for assessing how large multimodal models (LMMs) interpret images. However, most VQA datasets focus on real-world images or simple diagrammatic analysis, with few focused on interpreting complex scientific charts. Indeed, many VQA datasets that analyze charts do not contain the underlying data behind those charts or assume a 1-to-1 correspondence between chart marks and underlying data. In reality, charts are transformations (i.e. analysis, simplification, modification) of data. This distinction introduces a reasoning challenge in VQA that the current datasets do not capture. In this paper, we argue for a dedicated VQA benchmark for scientific charts where there is no 1-to-1 correspondence between chart marks and underlying data. To do so, we survey existing VQA datasets and highlight limitations of the current field. We then generate synthetic histogram charts based on ground truth data, and ask both humans and a large reasoning model questions where precise answers depend on access to the underlying data. We release the open-source dataset, including figures, underlying data, distribution parameters used to generate the data, and bounding boxes for all figure marks and text for future research.
Related papers
- In-Depth and In-Breadth: Pre-training Multimodal Language Models Customized for Comprehensive Chart Understanding [113.17601814293722]
We introduce ChartScope, an LVLM optimized for in-depth chart comprehension across diverse chart types.<n>We propose an efficient data generation pipeline that synthesizes paired data for a wide range of chart types.<n>We also establish ChartDQA, a new benchmark for evaluating not only question-answering at different levels but also underlying data understanding.
arXiv Detail & Related papers (2025-07-18T18:15:09Z) - ChartQAPro: A More Diverse and Challenging Benchmark for Chart Question Answering [27.58410749367183]
We introduce ChartQAPro, a new benchmark that includes 1,341 charts from 157 diverse sources, spanning various chart types.<n>Our evaluations with 21 models show a substantial performance drop for LVLMs on ChartQAPro.
arXiv Detail & Related papers (2025-04-07T21:05:06Z) - RefChartQA: Grounding Visual Answer on Chart Images through Instruction Tuning [63.599057862999]
RefChartQA is a novel benchmark that integrates Chart Question Answering (ChartQA) with visual grounding.<n>Our experiments demonstrate that incorporating spatial awareness via grounding improves response accuracy by over 15%.
arXiv Detail & Related papers (2025-03-29T15:50:08Z) - Chart-HQA: A Benchmark for Hypothetical Question Answering in Charts [62.45232157149698]
We introduce a novel Chart Hypothetical Question Answering (HQA) task, which imposes assumptions on the same question to compel models to engage in counterfactual reasoning based on the chart content.<n> Furthermore, we introduce HAI, a human-AI interactive data synthesis approach that leverages the efficient text-editing capabilities of MLLMs alongside human expert knowledge to generate diverse and high-quality HQA data at a low cost.
arXiv Detail & Related papers (2025-03-06T05:08:40Z) - ChartAssisstant: A Universal Chart Multimodal Language Model via
Chart-to-Table Pre-training and Multitask Instruction Tuning [54.89249749894061]
ChartAssistant is a vision-language model for universal chart comprehension and reasoning.
It undergoes a two-stage training process, starting with pre-training on chart-to-table parsing to align chart and text.
Experimental results demonstrate significant performance gains over the state-of-the-art UniChart and Chartllama method.
arXiv Detail & Related papers (2024-01-04T17:51:48Z) - RealCQA: Scientific Chart Question Answering as a Test-bed for
First-Order Logic [8.155575318208628]
We introduce a benchmark and dataset for chart visual QA on real-world charts.
Our contribution includes the introduction of a new answer type, 'list', with both ranked and unranked variations.
Results of our experiments, conducted on a real-world out-of-distribution dataset, provide a robust evaluation of large-scale pre-trained models.
arXiv Detail & Related papers (2023-08-03T18:21:38Z) - OpenCQA: Open-ended Question Answering with Charts [6.7038829115674945]
We introduce a new task called OpenCQA, where the goal is to answer an open-ended question about a chart with texts.
We implement and evaluate a set of baselines under three practical settings.
Our analysis of the results show that the top performing models generally produce fluent and coherent text.
arXiv Detail & Related papers (2022-10-12T23:37:30Z) - ChartQA: A Benchmark for Question Answering about Charts with Visual and
Logical Reasoning [7.192233658525916]
We present a benchmark covering 9.6K human-written questions and 23.1K questions generated from human-written chart summaries.
We present two transformer-based models that combine visual features and the data table of the chart in a unified way to answer questions.
arXiv Detail & Related papers (2022-03-19T05:00:30Z) - Question-Answer Sentence Graph for Joint Modeling Answer Selection [122.29142965960138]
We train and integrate state-of-the-art (SOTA) models for computing scores between question-question, question-answer, and answer-answer pairs.
Online inference is then performed to solve the AS2 task on unseen queries.
arXiv Detail & Related papers (2022-02-16T05:59:53Z) - Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in
Visual Question Answering [71.6781118080461]
We propose a Graph Matching Attention (GMA) network for Visual Question Answering (VQA) task.
firstly, it builds graph for the image, but also constructs graph for the question in terms of both syntactic and embedding information.
Next, we explore the intra-modality relationships by a dual-stage graph encoder and then present a bilateral cross-modality graph matching attention to infer the relationships between the image and the question.
Experiments demonstrate that our network achieves state-of-the-art performance on the GQA dataset and the VQA 2.0 dataset.
arXiv Detail & Related papers (2021-12-14T10:01:26Z) - Classification-Regression for Chart Comprehension [16.311371103939205]
Chart question answering (CQA) is a task used for assessing chart comprehension.
We propose a new model that jointly learns classification and regression.
Our model's edge is particularly emphasized on questions with out-of-vocabulary answers.
arXiv Detail & Related papers (2021-11-29T18:46:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.