Your Vision-Language Model Can't Even Count to 20: Exposing the Failures of VLMs in Compositional Counting
- URL: http://arxiv.org/abs/2510.04401v1
- Date: Mon, 06 Oct 2025 00:11:24 GMT
- Title: Your Vision-Language Model Can't Even Count to 20: Exposing the Failures of VLMs in Compositional Counting
- Authors: Xuyang Guo, Zekai Huang, Zhenmei Shi, Zhao Song, Jiahao Zhang,
- Abstract summary: Vision-Language Models (VLMs) have become a central focus of today's AI community.<n>This paper introduces a benchmark, VLMCountBench, designed under a minimalist setting with only basic geometric shapes.<n>We study the effects of simple properties such as color, size, and prompt refinement in a controlled ablation.
- Score: 16.47788191300739
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-Language Models (VLMs) have become a central focus of today's AI community, owing to their impressive abilities gained from training on large-scale vision-language data from the Web. These models have demonstrated strong performance across diverse tasks, including image understanding, video understanding, complex visual reasoning, and embodied AI. Despite these noteworthy successes, a fundamental question remains: Can VLMs count objects correctly? In this paper, we introduce a simple yet effective benchmark, VLMCountBench, designed under a minimalist setting with only basic geometric shapes (e.g., triangles, circles) and their compositions, focusing exclusively on counting tasks without interference from other factors. We adopt strict independent variable control and systematically study the effects of simple properties such as color, size, and prompt refinement in a controlled ablation. Our empirical results reveal that while VLMs can count reliably when only one shape type is present, they exhibit substantial failures when multiple shape types are combined (i.e., compositional counting). This highlights a fundamental empirical limitation of current VLMs and motivates important directions for future research.
Related papers
- Through the Lens of Contrast: Self-Improving Visual Reasoning in VLMs [60.93949629734977]
We propose Visual Contrastive Self-Taught Reasoner (VC-STaR) to mitigate hallucinations in model-generated rationales.<n>We collect a diverse suite of VQA datasets, curate contrastive pairs according to multi-modal similarity, and generate rationales using VC-STaR.<n>Extensive experiments show that VC-STaR not only outperforms existing self-improving approaches but also surpasses models finetuned on the SoTA visual reasoning datasets.
arXiv Detail & Related papers (2026-03-03T03:18:31Z) - Response Wide Shut? Surprising Observations in Basic Vision Language Model Capabilities [54.94982467313341]
Vision-language Models (VLMs) have emerged as general-purpose tools for addressing a variety of complex computer vision problems.<n>We set out to understand the limitations of SoTA VLMs on fundamental visual tasks by constructing a series of tests that probe which components of design, specifically, may be lacking.
arXiv Detail & Related papers (2025-07-10T15:26:41Z) - ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs [98.27348724529257]
We introduce ViCrit (Visual Caption Hallucination Critic), an RL proxy task that trains VLMs to localize a subtle, synthetic visual hallucination injected into paragraphs of human-written image captions.<n>Models trained with the ViCrit Task exhibit substantial gains across a variety of vision-language models benchmarks.
arXiv Detail & Related papers (2025-06-11T19:16:54Z) - Revisiting Visual Understanding in Multimodal Reasoning through a Lens of Image Perturbation [38.740533834549716]
We show that language-only models can achieve comparable or even better performance than MLLMs that consume raw visual inputs.<n>Motivated by this, we propose a simple visual perturbation framework that enhances perceptual robustness without requiring algorithmic modifications.<n>Our findings highlight the critical role of visual perturbation in multimodal mathematical reasoning.
arXiv Detail & Related papers (2025-06-11T13:39:46Z) - COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training [49.2684130383925]
We propose COSMOS: CrOSs-MOdality Self-distillation for vision-language pre-training.<n>It integrates a novel text-cropping strategy and cross-attention module into a self-supervised learning framework.<n>It consistently outperforms previous strong baselines on various zero-shot downstream tasks.
arXiv Detail & Related papers (2024-12-02T18:56:06Z) - Response Wide Shut: Surprising Observations in Basic Vision Language Model Capabilities [30.176918208200604]
Vision-Language Models (VLMs) have emerged as general purpose tools for addressing a variety of complex computer vision problems.
These models have been shown to be highly capable, but also lacking some basic visual understanding skills.
This paper sets out to understand the limitations of SoTA VLMs on fundamental visual tasks.
arXiv Detail & Related papers (2024-08-13T08:26:32Z) - Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models [73.40350756742231]
Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning.
Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored.
arXiv Detail & Related papers (2024-02-12T18:21:14Z) - Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs [50.77984109941538]
Our research reveals that the visual capabilities in recent multimodal LLMs still exhibit systematic shortcomings.
We identify ''CLIP-blind pairs'' - images that CLIP perceives as similar despite their clear visual differences.
We evaluate various CLIP-based vision-and-language models and found a notable correlation between visual patterns that challenge CLIP models and those problematic for multimodal LLMs.
arXiv Detail & Related papers (2024-01-11T18:58:36Z) - Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models [50.653838482083614]
This paper introduces a scalable test-bed to assess the capabilities of IT-LVLMs on fundamental computer vision tasks.<n> MERLIM contains over 300K image-question pairs and has a strong focus on detecting cross-modal "hallucination" events in IT-LVLMs.
arXiv Detail & Related papers (2023-12-03T16:39:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.