Related papers: CHART-6: Human-Centered Evaluation of Data Visualization Understanding in Vision-Language Models

CHART-6: Human-Centered Evaluation of Data Visualization Understanding in Vision-Language Models

URL: http://arxiv.org/abs/2505.17202v1
Date: Thu, 22 May 2025 18:15:04 GMT
Title: CHART-6: Human-Centered Evaluation of Data Visualization Understanding in Vision-Language Models
Authors: Arnav Verma, Kushin Mukherjee, Christopher Potts, Elisa Kreiss, Judith E. Fan,
Abstract summary: It is unclear to what degree vision-language models emulate human behavior on tasks that involve reasoning about data visualizations.<n>Here we evaluated eight vision-language models on six data visualization literacy assessments designed for humans.<n>We found that these models performed worse than human participants on average.
Score: 18.891323067948285
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Data visualizations are powerful tools for communicating patterns in quantitative data. Yet understanding any data visualization is no small feat -- succeeding requires jointly making sense of visual, numerical, and linguistic inputs arranged in a conventionalized format one has previously learned to parse. Recently developed vision-language models are, in principle, promising candidates for developing computational models of these cognitive operations. However, it is currently unclear to what degree these models emulate human behavior on tasks that involve reasoning about data visualizations. This gap reflects limitations in prior work that has evaluated data visualization understanding in artificial systems using measures that differ from those typically used to assess these abilities in humans. Here we evaluated eight vision-language models on six data visualization literacy assessments designed for humans and compared model responses to those of human participants. We found that these models performed worse than human participants on average, and this performance gap persisted even when using relatively lenient criteria to assess model performance. Moreover, while relative performance across items was somewhat correlated between models and humans, all models produced patterns of errors that were reliably distinct from those produced by human participants. Taken together, these findings suggest significant opportunities for further development of artificial systems that might serve as useful models of how humans reason about data visualizations. All code and data needed to reproduce these results are available at: https://osf.io/e25mu/?view_only=399daff5a14d4b16b09473cf19043f18.

Related papers

Testing the limits of fine-tuning to improve reasoning in vision language models [51.58859621164201]
We introduce visual stimuli and human judgments on visual cognition tasks to evaluate performance across cognitive domains.<n>We fine-tune models on ground truth data for intuitive physics and causal reasoning.<n>We find that fine-tuning does not contribute to robust human-like generalization to data with other visual characteristics or to tasks in other cognitive domains.
arXiv Detail & Related papers (2025-02-21T18:58:30Z)
DevBench: A multimodal developmental benchmark for language learning [0.34129029452670606]
We introduce DevBench, a benchmark for evaluating vision-language models on tasks and behavioral data.<n>We show that DevBench provides a benchmark for comparing models to human language development.<n>These comparisons highlight ways in which model and human language learning processes diverge.
arXiv Detail & Related papers (2024-06-14T17:49:41Z)
Learning Defect Prediction from Unrealistic Data [57.53586547895278]
Pretrained models of code have become popular choices for code understanding and generation tasks. Such models tend to be large and require commensurate volumes of training data. It has become popular to train models with far larger but less realistic datasets, such as functions with artificially injected bugs. Models trained on such data tend to only perform well on similar data, while underperforming on real world programs.
arXiv Detail & Related papers (2023-11-02T01:51:43Z)
Visual Grounding Helps Learn Word Meanings in Low-Data Regimes [47.7950860342515]
Modern neural language models (LMs) are powerful tools for modeling human sentence production and comprehension. But to achieve these results, LMs must be trained in distinctly un-human-like ways. Do models trained more naturalistically -- with grounded supervision -- exhibit more humanlike language learning? We investigate this question in the context of word learning, a key sub-task in language acquisition.
arXiv Detail & Related papers (2023-10-20T03:33:36Z)
Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural Language Generation [68.9440575276396]
This survey aims to provide an overview of the recent research that has leveraged human feedback to improve natural language generation. First, we introduce an encompassing formalization of feedback, and identify and organize existing research into a taxonomy following this formalization. Second, we discuss how feedback can be described by its format and objective, and cover the two approaches proposed to use feedback (either for training or decoding): directly using the feedback or training feedback models. Third, we provide an overview of the nascent field of AI feedback, which exploits large language models to make judgments based on a set of principles and minimize the need for
arXiv Detail & Related papers (2023-05-01T17:36:06Z)
Chain of Hindsight Aligns Language Models with Feedback [62.68665658130472]
We propose a novel technique, Chain of Hindsight, that is easy to optimize and can learn from any form of feedback, regardless of its polarity. We convert all types of feedback into sequences of sentences, which are then used to fine-tune the model. By doing so, the model is trained to generate outputs based on feedback, while learning to identify and correct negative attributes or errors.
arXiv Detail & Related papers (2023-02-06T10:28:16Z)
Localization vs. Semantics: Visual Representations in Unimodal and Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models. Our empirical observations suggest that vision-and-language models are better at label prediction tasks. We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z)
Ambiguous Images With Human Judgments for Robust Visual Event Classification [34.62731821199598]
We create datasets of ambiguous images and use them to produce SQUID-E ("Squidy"), a collection of noisy images extracted from videos. All images are annotated with ground truth values and a test set is annotated with human uncertainty judgments. We use this dataset to characterize human uncertainty in vision tasks and evaluate existing visual event classification models.
arXiv Detail & Related papers (2022-10-06T17:52:20Z)
Benchmarking human visual search computational models in natural scenes: models comparison and reference datasets [0.0]
We select publicly available state-of-the-art visual search models in natural scenes and evaluate them on different datasets. We propose an improvement to the Ideal Bayesian Searcher through a combination with a neural network-based visual search model.
arXiv Detail & Related papers (2021-12-10T19:56:45Z)
Exploring Alignment of Representations with Human Perception [47.53970721813083]
We show that inputs that are mapped to similar representations by the model should be perceived similarly by humans. Our approach yields a measure of the extent to which a model is aligned with human perception. We find that various properties of a model like its architecture, training paradigm, training loss, and data augmentation play a significant role in learning representations that are aligned with human perception.
arXiv Detail & Related papers (2021-11-29T17:26:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.