Related papers: Assessing the Visual Enumeration Abilities of Specialized Counting Architectures and Vision-Language Models

Assessing the Visual Enumeration Abilities of Specialized Counting Architectures and Vision-Language Models

URL: http://arxiv.org/abs/2512.15254v1
Date: Wed, 17 Dec 2025 09:56:25 GMT
Title: Assessing the Visual Enumeration Abilities of Specialized Counting Architectures and Vision-Language Models
Authors: Kuinan Hou, Jing Mi, Marco Zorzi, Lamberto Ballan, Alberto Testolin,
Abstract summary: multimodal vision-language models (VLMs) may offer a flexible alternative for open-set object counting.<n>VLMs can approximately enumerate the number of items in a visual scene, matching or even surpassing the performance of specialized computer vision architectures.<n>None of the models can reliably count the number of objects in complex visual scenes.
Score: 5.310444614342132
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Counting the number of items in a visual scene remains a fundamental yet challenging task in computer vision. Traditional approaches to solving this problem rely on domain-specific counting architectures, which are trained using datasets annotated with a predefined set of object categories. However, recent progress in creating large-scale multimodal vision-language models (VLMs) suggests that these domain-general architectures may offer a flexible alternative for open-set object counting. In this study, we therefore systematically compare the performance of state-of-the-art specialized counting architectures against VLMs on two popular counting datasets, as well as on a novel benchmark specifically created to have a finer-grained control over the visual properties of test images. Our findings show that most VLMs can approximately enumerate the number of items in a visual scene, matching or even surpassing the performance of specialized computer vision architectures. Notably, enumeration accuracy significantly improves when VLMs are prompted to generate intermediate representations (i.e., locations and verbal labels) of each object to be counted. Nevertheless, none of the models can reliably count the number of objects in complex visual scenes, showing that further research is still needed to create AI systems that can reliably deploy counting procedures in realistic environments.

Related papers

Can Vision-Language Models Count? A Synthetic Benchmark and Analysis of Attention-Based Interventions [0.4934817254755008]
Vision Language Models (VLMs) often rely on inherent biases learned during training when responding to queries about visual properties of images.<n>We build upon this research by developing a synthetic benchmark dataset and evaluation framework to determine how counting performance varies as image and prompt properties change.<n>We implement attention-based interventions to focus on visual tokens at different layers and evaluate their impact on counting performance across a range of visual conditions.
arXiv Detail & Related papers (2025-11-21T19:18:41Z)
Understanding Counting Mechanisms in Large Language and Vision-Language Models [8.918147502104603]
We study how large language models (LLMs) and large vision-language models (LVLMs) represent and compute numerical information in counting tasks.<n>Results show that individual tokens or visual features encode latent positional count information that can be extracted and transferred across contexts.<n>In LVLMs, numerical information also appears in visual embeddings, shifting between background and foreground regions depending on spatial composition.
arXiv Detail & Related papers (2025-11-21T18:48:22Z)
Can Current AI Models Count What We Mean, Not What They See? A Benchmark and Systematic Evaluation [21.90583276089241]
PairTally is a benchmark dataset designed to evaluate fine-grained visual counting.<n>Each of the 681 high-resolution images in PairTally contains two object categories.<n>We show that despite recent advances, current models struggle to reliably count what users intend.
arXiv Detail & Related papers (2025-09-17T13:06:58Z)
QID: Efficient Query-Informed ViTs in Data-Scarce Regimes for OCR-free Visual Document Understanding [53.69841526266547]
Fine-tuning a pre-trained Vision-Language Model with new datasets often falls short in optimizing the vision encoder.<n>We introduce QID, a novel, streamlined, architecture-preserving approach that integrates query embeddings into the vision encoder.
arXiv Detail & Related papers (2025-04-03T18:47:16Z)
Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering [10.505845766495128]
Multimodal large language models (MLLMs) have made significant progress in integrating visual and textual modalities.<n>We propose a novel framework based on multimodal retrieval-augmented generation (RAG)<n>RAG introduces structured scene graphs to enhance object recognition, relationship identification, and spatial understanding within images.
arXiv Detail & Related papers (2024-12-30T13:16:08Z)
Teaching VLMs to Localize Specific Objects from In-context Examples [56.797110842152]
We find that present-day Vision-Language Models (VLMs) lack a fundamental cognitive ability: learning to localize specific objects in a scene by taking into account the context.<n>This work is the first to explore and benchmark personalized few-shot localization for VLMs.
arXiv Detail & Related papers (2024-11-20T13:34:22Z)
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [61.143381152739046]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.<n>Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.<n>We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z)
Interpreting the structure of multi-object representations in vision encoders [1.8749305679160366]
We evaluate vision encoders pre-trained on classification, large vision-language models, and self-supervised methods.<n>We examine how object-wise representations are distributed across tokens and layers within these vision encoders.<n>Our findings highlight significant differences in the representation of objects depending on their relevance to the pre-training objective.
arXiv Detail & Related papers (2024-06-13T12:54:20Z)
BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation [57.40024206484446]
We introduce the BEHAVIOR Vision Suite (BVS), a set of tools and assets to generate fully customized synthetic data for systematic evaluation of computer vision models. BVS supports a large number of adjustable parameters at the scene level. We showcase three example application scenarios.
arXiv Detail & Related papers (2024-05-15T17:57:56Z)
Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model [83.85856356798531]
VistaLLM is a visual system that addresses coarse- and fine-grained vision-language tasks. It employs a gradient-aware adaptive sampling technique to represent binary segmentation masks as sequences. We also introduce a novel task, AttCoSeg, which boosts the model's reasoning and grounding capability over multiple input images.
arXiv Detail & Related papers (2023-12-19T18:53:01Z)
Look-into-Object: Self-supervised Structure Modeling for Object Recognition [71.68524003173219]
We propose to "look into object" (explicitly yet intrinsically model the object structure) through incorporating self-supervisions. We show the recognition backbone can be substantially enhanced for more robust representation learning. Our approach achieves large performance gain on a number of benchmarks, including generic object recognition (ImageNet) and fine-grained object recognition tasks (CUB, Cars, Aircraft)
arXiv Detail & Related papers (2020-03-31T12:22:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.