Evaluating Graphical Perception with Multimodal LLMs
- URL: http://arxiv.org/abs/2504.04221v1
- Date: Sat, 05 Apr 2025 16:14:08 GMT
- Title: Evaluating Graphical Perception with Multimodal LLMs
- Authors: Rami Huu Nguyen, Kenichi Maeda, Mahsa Geshvadi, Daniel Haehn,
- Abstract summary: Multimodal Large Language Models (MLLMs) have remarkably progressed in analyzing and understanding images.<n>For visualization, how do MLLMs perform when applied to graphical perception tasks?<n>Our study primarily evaluates fine-tuned and pretrained models and zero-shot prompting to determine if they closely match human graphical perception.
- Score: 2.090547583226381
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal Large Language Models (MLLMs) have remarkably progressed in analyzing and understanding images. Despite these advancements, accurately regressing values in charts remains an underexplored area for MLLMs. For visualization, how do MLLMs perform when applied to graphical perception tasks? Our paper investigates this question by reproducing Cleveland and McGill's seminal 1984 experiment and comparing it against human task performance. Our study primarily evaluates fine-tuned and pretrained models and zero-shot prompting to determine if they closely match human graphical perception. Our findings highlight that MLLMs outperform human task performance in some cases but not in others. We highlight the results of all experiments to foster an understanding of where MLLMs succeed and fail when applied to data visualization.
Related papers
- Visual Chronicles: Using Multimodal LLMs to Analyze Massive Collections of Images [58.38037252899024]
We present a system using Multimodal LLMs to analyze a large database with tens of millions of images.
We aim to capture frequent co-occurring changes ("trends") across a city over a certain period.
We find it significantly outperforms baselines and is able to discover interesting trends from images captured in large cities.
arXiv Detail & Related papers (2025-04-11T17:55:45Z) - MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs [11.532430076027554]
We study whether MLLMs can perceive small visual details as effectively as large ones when answering questions about images.
We propose training-free visual intervention methods that leverage the internal knowledge of any MLLM itself.
arXiv Detail & Related papers (2025-02-24T18:54:40Z) - Open Eyes, Then Reason: Fine-grained Visual Mathematical Understanding in MLLMs [62.875934732547435]
Current large language models (MLLMs) often underperform on mathematical problem-solving tasks that require fine-grained visual understanding.<n>In this paper, we evaluate the visual grounding capabilities of state-of-the-art MLLMs and reveal a significant negative correlation between visual grounding accuracy and problem-solving performance.<n>We propose a novel approach, SVE-Math, featuring a geometric-grounded vision encoder and a feature router that dynamically adjusts the contribution of hierarchical visual feature maps.
arXiv Detail & Related papers (2025-01-11T04:08:44Z) - Revisiting MLLMs: An In-Depth Analysis of Image Classification Abilities [31.293869275511412]
This paper thoroughly revisits the Multimodal Large Language Models (MLLMs) with an in-depth analysis of image classification.
Our findings reveal that the most recent MLLMs can match or even outperform CLIP-style vision-language models on several datasets.
arXiv Detail & Related papers (2024-12-21T00:46:56Z) - Do Multimodal Large Language Models See Like Humans? [50.938168841711445]
Multimodal Large Language Models (MLLMs) have achieved impressive results on various vision tasks, leveraging recent advancements in large language models.<n>Current benchmarks lack the ability to evaluate MLLMs from this perspective.<n>We introduce HVSBench, a large-scale benchmark designed to assess the alignment between MLLMs and the human visual system.
arXiv Detail & Related papers (2024-12-12T18:59:25Z) - Face-MLLM: A Large Face Perception Model [53.9441375205716]
multimodal large language models (MLLMs) have achieved promising results on a wide range of vision-language tasks, but their ability to perceive and understand human faces is rarely explored.
In this work, we comprehensively evaluate existing MLLMs on face perception tasks.
Our model surpasses previous MLLMs on five famous face perception tasks.
arXiv Detail & Related papers (2024-10-28T04:19:32Z) - Visualization Literacy of Multimodal Large Language Models: A Comparative Study [12.367399155606162]
multimodal large language models (MLLMs) combine the inherent power of large language models (LLMs) with the renewed capabilities to reason about the multimodal context.
Many recent works in visualization have demonstrated MLLMs' capability to understand and interpret visualization results and explain the content of the visualization to users in natural language.
In this work, we aim to fill the gap by utilizing the concept of visualization literacy to evaluate MLLMs.
arXiv Detail & Related papers (2024-06-24T17:52:16Z) - The Instinctive Bias: Spurious Images lead to Illusion in MLLMs [34.91795817316696]
We identify a typical class of inputs that baffles MLLMs, which consist of images that are highly relevant but inconsistent with answers.
We propose CorrelationQA, the first benchmark that assesses the visual illusion level given spurious images.
We conduct a thorough analysis on 9 mainstream MLLMs, illustrating that they universally suffer from this instinctive bias to varying degrees.
arXiv Detail & Related papers (2024-02-06T06:48:46Z) - Mementos: A Comprehensive Benchmark for Multimodal Large Language Model
Reasoning over Image Sequences [80.54979242912944]
This paper introduces Mementos, a new benchmark designed to assess MLLMs' sequential image reasoning abilities.
We find that MLLMs struggle to accurately describe dynamic information about given image sequences, often leading to hallucinations/misrepresentations of objects.
arXiv Detail & Related papers (2024-01-19T07:10:13Z) - Investigating the Catastrophic Forgetting in Multimodal Large Language
Models [43.89009178021342]
We introduce EMT: evaluating MulTimodality for evaluating the catastrophic forgetting in MLLMs.
Almost all evaluated MLLMs fail to retain the same performance levels as their vision encoders on standard image classification tasks.
As fine-tuning proceeds, the MLLMs begin to hallucinate, resulting in a significant loss of generalizability.
arXiv Detail & Related papers (2023-09-19T04:51:13Z) - MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models [73.86954509967416]
Multimodal Large Language Model (MLLM) relies on the powerful LLM to perform multimodal tasks.
This paper presents the first comprehensive MLLM Evaluation benchmark MME.
It measures both perception and cognition abilities on a total of 14 subtasks.
arXiv Detail & Related papers (2023-06-23T09:22:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.