Diagnosing Bottlenecks in Data Visualization Understanding by Vision-Language Models
- URL: http://arxiv.org/abs/2510.21740v1
- Date: Thu, 02 Oct 2025 18:29:07 GMT
- Title: Diagnosing Bottlenecks in Data Visualization Understanding by Vision-Language Models
- Authors: Alexa R. Tartaglini, Satchel Grant, Daniel Wurgaft, Christopher Potts, Judith E. Fan,
- Abstract summary: Current vision-language models (VLMs) still struggle on basic data visualization understanding tasks.<n>Are VLM failures attributable to limitations in how visual information in the data visualization is encoded, how information is transferred between the vision and language modules, or how information is processed within the language module?<n>We developed FUGU, a suite of data visualization understanding tasks, to precisely characterize potential sources of difficulty.
- Score: 25.564425023762045
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data visualizations are vital components of many scientific articles and news stories. Current vision-language models (VLMs) still struggle on basic data visualization understanding tasks, but the causes of failure remain unclear. Are VLM failures attributable to limitations in how visual information in the data visualization is encoded, how information is transferred between the vision and language modules, or how information is processed within the language module? We developed FUGU, a suite of data visualization understanding tasks, to precisely characterize potential sources of difficulty (e.g., extracting the position of data points, distances between them, and other summary statistics). We used FUGU to investigate three widely used VLMs. To diagnose the sources of errors produced by these models, we used activation patching and linear probes to trace information flow through models across a variety of prompting strategies. We found that some models fail to generate the coordinates of individual data points correctly, and these initial errors often lead to erroneous final responses. When these models are provided with the correct coordinates, performance improves substantially. Moreover, even when the model generates an incorrect response, the correct coordinates can be successfully read out from the latent representations in the vision encoder, suggesting that the source of these errors lies in the vision-language handoff. We further found that while providing correct coordinates helps with tasks involving one or a small number of data points, it generally worsens performance for tasks that require extracting statistical relationships across many data points. Fine-tuning models on FUGU also fails to yield ceiling performance. These findings point to architectural constraints in current VLMs that might pose significant challenges for reliable data visualization understanding.
Related papers
- Vision-G1: Towards General Vision Language Reasoning with Multi-Domain Data Curation [64.23194519770897]
We build a comprehensive RL-ready visual reasoning dataset from 46 data sources across 8 dimensions.<n>We propose an influence function based data selection and difficulty based filtering strategy to identify high-quality training samples from this dataset.<n>We train the VLM, referred to as Vision-G1, using multi-round RL with a data curriculum to iteratively improve its visual reasoning capabilities.
arXiv Detail & Related papers (2025-08-18T07:24:33Z) - Bootstrapping Grounded Chain-of-Thought in Multimodal LLMs for Data-Efficient Model Adaptation [25.283739839182147]
We show that training an MLLM with Chain-of-Thought (CoT) reasoning data can facilitate model adaptation in specialized vision tasks.<n>We propose Grounded Chain-of-Thought (GCoT), a simple bootstrapping-based approach that aims to inject grounding information into CoT data.<n>We evaluate our approach on five specialized vision tasks, which cover a variety of visual formats.
arXiv Detail & Related papers (2025-07-03T17:59:29Z) - Bring Remote Sensing Object Detect Into Nature Language Model: Using SFT Method [10.748210940033484]
Large language models (LLMs) and vision-language models (VLMs) have achieved significant success.<n>Due to the substantial differences between remote sensing images and conventional optical images, these models face challenges in comprehension.<n>This letter explores the application of VLMs for object detection in remote sensing images.
arXiv Detail & Related papers (2025-03-11T08:02:54Z) - Open Eyes, Then Reason: Fine-grained Visual Mathematical Understanding in MLLMs [62.875934732547435]
Current large language models (MLLMs) often underperform on mathematical problem-solving tasks that require fine-grained visual understanding.<n>In this paper, we evaluate the visual grounding capabilities of state-of-the-art MLLMs and reveal a significant negative correlation between visual grounding accuracy and problem-solving performance.<n>We propose a novel approach, SVE-Math, featuring a geometric-grounded vision encoder and a feature router that dynamically adjusts the contribution of hierarchical visual feature maps.
arXiv Detail & Related papers (2025-01-11T04:08:44Z) - Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - Visual Data-Type Understanding does not emerge from Scaling
Vision-Language Models [31.69213233651326]
We introduce the novel task of Visual Data-Type Identification.
An extensive zero-shot evaluation of 39 vision-language models (VLMs) shows a nuanced performance landscape.
arXiv Detail & Related papers (2023-10-12T17:59:30Z) - VIGC: Visual Instruction Generation and Correction [47.477290387002284]
The scarcity of high-quality instruction-tuning data for vision-language tasks remains a challenge.
The current leading paradigm, such as LLaVA, relies on language-only GPT-4 to generate data.
This paper proposes the Visual Instruction Generation and Correction framework that enables multimodal large language models to generate instruction-tuning data.
arXiv Detail & Related papers (2023-08-24T11:21:05Z) - Unified Visual Relationship Detection with Vision and Language Models [89.77838890788638]
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets.
We propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models.
Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model.
arXiv Detail & Related papers (2023-03-16T00:06:28Z) - Neural Relation Graph: A Unified Framework for Identifying Label Noise
and Outlier Data [44.64190826937705]
We present scalable algorithms for detecting label errors and outlier data based on the relational graph structure of data.
We also introduce a visualization tool that provides contextual information of a data point in the feature-embedded space.
Our approach achieves state-of-the-art detection performance on all tasks considered and demonstrates its effectiveness in large-scale real-world datasets.
arXiv Detail & Related papers (2023-01-29T02:09:13Z) - Discover, Explanation, Improvement: An Automatic Slice Detection
Framework for Natural Language Processing [72.14557106085284]
slice detection models (SDM) automatically identify underperforming groups of datapoints.
This paper proposes a benchmark named "Discover, Explain, improve (DEIM)" for classification NLP tasks.
Our evaluation shows that Edisa can accurately select error-prone datapoints with informative semantic features.
arXiv Detail & Related papers (2022-11-08T19:00:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.