Related papers: [De|Re]constructing VLMs' Reasoning in Counting

[De|Re]constructing VLMs' Reasoning in Counting

URL: http://arxiv.org/abs/2510.19555v1
Date: Wed, 22 Oct 2025 13:08:47 GMT
Title: [De|Re]constructing VLMs' Reasoning in Counting
Authors: Simone Alghisi, Gabriel Roccabruna, Massimo Rizzoli, Seyed Mahed Mousavi, Giuseppe Riccardi,
Abstract summary: We study the reasoning skills of seven state-of-the-art Vision-Language Models (VLMs) in the counting task under controlled experimental conditions.<n>A layer-wise analysis reveals that errors are due to incorrect mapping of the last-layer representation into the output space.<n>Our targeted training shows that fine-tuning just the output layer improves accuracy by up to 21%.
Score: 2.1856941852799134
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Vision-Language Models (VLMs) have recently gained attention due to their competitive performance on multiple downstream tasks, achieved by following user-input instructions. However, VLMs still exhibit several limitations in visual reasoning, such as difficulties in identifying relations (e.g., spatial, temporal, and among objects), understanding temporal sequences (e.g., frames), and counting objects. In this work, we go beyond score-level benchmark evaluations of VLMs by investigating the underlying causes of their failures and proposing a targeted approach to improve their reasoning capabilities. We study the reasoning skills of seven state-of-the-art VLMs in the counting task under controlled experimental conditions. Our experiments show that VLMs are highly sensitive to the number and type of objects, their spatial arrangement, and the co-occurrence of distractors. A layer-wise analysis reveals that errors are due to incorrect mapping of the last-layer representation into the output space. Our targeted training shows that fine-tuning just the output layer improves accuracy by up to 21%. We corroborate these findings by achieving consistent improvements on real-world datasets.

Related papers

Response Wide Shut? Surprising Observations in Basic Vision Language Model Capabilities [54.94982467313341]
Vision-language Models (VLMs) have emerged as general-purpose tools for addressing a variety of complex computer vision problems.<n>We set out to understand the limitations of SoTA VLMs on fundamental visual tasks by constructing a series of tests that probe which components of design, specifically, may be lacking.
arXiv Detail & Related papers (2025-07-10T15:26:41Z)
CIVET: Systematic Evaluation of Understanding in VLMs [1.760351767085873]
Vision-Language Models (VLMs) have achieved competitive performance in various tasks.<n>We study their capability regarding object properties and relations in a controlled and interpretable manner.<n>We introduce CIVET, a novel and framework for systematiC evaluatIon VialEd sTimuli.
arXiv Detail & Related papers (2025-06-05T15:27:16Z)
Caption This, Reason That: VLMs Caught in the Middle [3.4820139118440676]
Vision-Language Models (VLMs) have shown remarkable progress in visual understanding in recent years.<n>They still lag behind human capabilities in specific visual tasks such as counting or relational reasoning.<n>We analyze VLM performance along core cognitive axes: Perception, Attention, and Memory.
arXiv Detail & Related papers (2025-05-24T14:25:48Z)
Vision language models are unreliable at trivial spatial cognition [0.2902243522110345]
Vision language models (VLMs) are designed to extract relevant visuospatial information from images.<n>We develop a benchmark dataset -- TableTest -- whose images depict 3D scenes of objects arranged on a table, and used it to evaluate state-of-the-art VLMs.<n>Results show that performance could be degraded by minor variations of prompts that use equivalent descriptions.
arXiv Detail & Related papers (2025-04-22T17:38:01Z)
Beyond Semantics: Rediscovering Spatial Awareness in Vision-Language Models [13.768090541138571]
Vision Language Models (VLMs) excel at identifying and describing objects but often fail at spatial reasoning.<n>Our analysis reveals a key imbalance: vision token embeddings have much larger norms than text tokens.<n>Tools uncover that vision tokens and system prompts dominate attention.
arXiv Detail & Related papers (2025-03-21T17:51:14Z)
Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas [69.56484419619919]
We study the spatial reasoning challenge from the lens of mechanistic interpretability.<n>We observe that successful spatial reasoning correlates strongly with the model's ability to align its attention with actual object locations.<n>Motivated by these findings, we propose ADAPTVIS to sharpen the attention on highly relevant regions when confident.
arXiv Detail & Related papers (2025-03-03T17:57:03Z)
iVISPAR -- An Interactive Visual-Spatial Reasoning Benchmark for VLMs [4.34964016971127]
iVISPAR is an interactive benchmark designed to evaluate the spatial reasoning capabilities of Vision-Language Models (VLMs) acting as agents.<n>The benchmark supports visual 3D, 2D, and text-based input modalities, enabling comprehensive assessments of VLMs' planning and reasoning skills.<n>Results indicate that while VLMs perform better on 2D tasks compared to 3D or text-based settings, they struggle with complex spatial configurations and consistently fall short of human performance.
arXiv Detail & Related papers (2025-02-05T14:29:01Z)
Elevating Visual Perception in Multimodal LLMs with Visual Embedding Distillation [109.5893580175657]
In recent times, the standard practice for developing MLLMs is to feed features from vision encoder(s) into the LLM and train with natural language supervision.<n>This approach often causes models to lean towards language comprehension and undermine the rich visual perception signals present in the data.<n>We propose VisPer-LM, the first approach that infuses visual perception knowledge from expert vision encoders into the LLM's hidden representations.
arXiv Detail & Related papers (2024-12-12T18:55:18Z)
MarvelOVD: Marrying Object Recognition and Vision-Language Models for Robust Open-Vocabulary Object Detection [107.15164718585666]
We investigate the root cause of VLMs' biased prediction under the open vocabulary detection context. Our observations lead to a simple yet effective paradigm, coded MarvelOVD, that generates significantly better training targets. Our method outperforms the other state-of-the-arts by significant margins.
arXiv Detail & Related papers (2024-07-31T09:23:57Z)
VSP: Assessing the dual challenges of perception and reasoning in spatial planning tasks for VLMs [102.36953558562436]
Vision language models (VLMs) are an exciting emerging class of language models (LMs) One understudied capability inVLMs is visual spatial planning. Our study introduces a benchmark that evaluates the spatial planning capability in these models in general.
arXiv Detail & Related papers (2024-07-02T00:24:01Z)
GSR-BENCH: A Benchmark for Grounded Spatial Reasoning Evaluation via Multimodal LLMs [3.2688425993442696]
The ability to understand and reason about spatial relationships between objects in images is an important component of visual reasoning. We extend the previously released What'sUp dataset and propose a novel comprehensive evaluation for spatial relationship understanding.
arXiv Detail & Related papers (2024-06-19T06:15:26Z)
Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models [61.28463542324576]
Vision-language models (VLMs) have recently demonstrated strong efficacy as visual assistants that can generate human-like outputs. We evaluate existing state-of-the-art VLMs and find that even the best-performing model is unable to demonstrate strong visual reasoning capabilities and consistency. We propose a two-stage training framework aimed at improving both the reasoning performance and consistency of VLMs.
arXiv Detail & Related papers (2023-09-08T17:49:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.