Can Current AI Models Count What We Mean, Not What They See? A Benchmark and Systematic Evaluation
- URL: http://arxiv.org/abs/2509.13939v1
- Date: Wed, 17 Sep 2025 13:06:58 GMT
- Title: Can Current AI Models Count What We Mean, Not What They See? A Benchmark and Systematic Evaluation
- Authors: Gia Khanh Nguyen, Yifeng Huang, Minh Hoai,
- Abstract summary: PairTally is a benchmark dataset designed to evaluate fine-grained visual counting.<n>Each of the 681 high-resolution images in PairTally contains two object categories.<n>We show that despite recent advances, current models struggle to reliably count what users intend.
- Score: 21.90583276089241
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual counting is a fundamental yet challenging task, especially when users need to count objects of a specific type in complex scenes. While recent models, including class-agnostic counting models and large vision-language models (VLMs), show promise in counting tasks, their ability to perform fine-grained, intent-driven counting remains unclear. In this paper, we introduce PairTally, a benchmark dataset specifically designed to evaluate fine-grained visual counting. Each of the 681 high-resolution images in PairTally contains two object categories, requiring models to distinguish and count based on subtle differences in shape, size, color, or semantics. The dataset includes both inter-category (distinct categories) and intra-category (closely related subcategories) settings, making it suitable for rigorous evaluation of selective counting capabilities. We benchmark a variety of state-of-the-art models, including exemplar-based methods, language-prompted models, and large VLMs. Our results show that despite recent advances, current models struggle to reliably count what users intend, especially in fine-grained and visually ambiguous cases. PairTally provides a new foundation for diagnosing and improving fine-grained visual counting systems.
Related papers
- Assessing the Visual Enumeration Abilities of Specialized Counting Architectures and Vision-Language Models [5.310444614342132]
multimodal vision-language models (VLMs) may offer a flexible alternative for open-set object counting.<n>VLMs can approximately enumerate the number of items in a visual scene, matching or even surpassing the performance of specialized computer vision architectures.<n>None of the models can reliably count the number of objects in complex visual scenes.
arXiv Detail & Related papers (2025-12-17T09:56:25Z) - Object Counting with GPT-4o and GPT-5: A Comparative Study [2.624902795082451]
Zero-shot object counting attempts to estimate the number of object instances belonging to novel categories that the vision model performing the counting has never encountered during training.<n>Existing methods typically require large amount of annotated data and often require visual exemplars to guide the counting process.<n>Large language models (LLMs) are powerful tools with remarkable reasoning and data understanding abilities, which suggest the possibility of utilizing them for counting tasks without any supervision.
arXiv Detail & Related papers (2025-12-02T21:07:13Z) - LVLM-COUNT: Enhancing the Counting Ability of Large Vision-Language Models [5.892066196730199]
Large vision-language models (LVLMs) are known to struggle with counting tasks.<n>We propose a simple yet effective baseline method that enhances LVLMs' counting ability for large numbers of objects.<n>We demonstrate the effectiveness of this approach across various datasets and benchmarks.
arXiv Detail & Related papers (2024-12-01T05:50:22Z) - Envisioning Class Entity Reasoning by Large Language Models for Few-shot Learning [13.68867780184022]
Few-shot learning aims to recognize new concepts using a limited number of visual samples.
Our framework incorporates both the abstract class semantics and the concrete class entities extracted from Large Language Models (LLMs)
For the challenging one-shot setting, our approach, utilizing the ResNet-12 backbone, achieves an average improvement of 1.95% over the second-best competitor.
arXiv Detail & Related papers (2024-08-22T15:10:20Z) - Zero-Shot Object Counting with Language-Vision Models [50.1159882903028]
Class-agnostic object counting aims to count object instances of an arbitrary class at test time.
Current methods require human-annotated exemplars as inputs which are often unavailable for novel categories.
We propose zero-shot object counting (ZSC), a new setting where only the class name is available during test time.
arXiv Detail & Related papers (2023-09-22T14:48:42Z) - Anchor Points: Benchmarking Models with Much Fewer Examples [88.02417913161356]
In six popular language classification benchmarks, model confidence in the correct class on many pairs of points is strongly correlated across models.
We propose Anchor Point Selection, a technique to select small subsets of datasets that capture model behavior across the entire dataset.
Just several anchor points can be used to estimate model per-class predictions on all other points in a dataset with low mean absolute error.
arXiv Detail & Related papers (2023-09-14T17:45:51Z) - Learning from Pseudo-labeled Segmentation for Multi-Class Object
Counting [35.652092907690694]
Class-agnostic counting (CAC) has numerous potential applications across various domains.
The goal is to count objects of an arbitrary category during testing, based on only a few annotated exemplars.
We show that the segmentation model trained on these pseudo-labeled masks can effectively localize objects of interest for an arbitrary multi-class image.
arXiv Detail & Related papers (2023-07-15T01:33:19Z) - Exploiting Category Names for Few-Shot Classification with
Vision-Language Models [78.51975804319149]
Vision-language foundation models pretrained on large-scale data provide a powerful tool for many visual understanding tasks.
This paper shows that we can significantly improve the performance of few-shot classification by using the category names to initialize the classification head.
arXiv Detail & Related papers (2022-11-29T21:08:46Z) - Multi-Modal Few-Shot Object Detection with Meta-Learning-Based
Cross-Modal Prompting [77.69172089359606]
We study multi-modal few-shot object detection (FSOD) in this paper, using both few-shot visual examples and class semantic information for detection.
Our approach is motivated by the high-level conceptual similarity of (metric-based) meta-learning and prompt-based learning.
We comprehensively evaluate the proposed multi-modal FSOD models on multiple few-shot object detection benchmarks, achieving promising results.
arXiv Detail & Related papers (2022-04-16T16:45:06Z) - Semantic Representation and Dependency Learning for Multi-Label Image
Recognition [76.52120002993728]
We propose a novel and effective semantic representation and dependency learning (SRDL) framework to learn category-specific semantic representation for each category.
Specifically, we design a category-specific attentional regions (CAR) module to generate channel/spatial-wise attention matrices to guide model.
We also design an object erasing (OE) module to implicitly learn semantic dependency among categories by erasing semantic-aware regions.
arXiv Detail & Related papers (2022-04-08T00:55:15Z) - Synthesizing the Unseen for Zero-shot Object Detection [72.38031440014463]
We propose to synthesize visual features for unseen classes, so that the model learns both seen and unseen objects in the visual domain.
We use a novel generative model that uses class-semantics to not only generate the features but also to discriminatively separate them.
arXiv Detail & Related papers (2020-10-19T12:36:11Z) - A Few-Shot Sequential Approach for Object Counting [63.82757025821265]
We introduce a class attention mechanism that sequentially attends to objects in the image and extracts their relevant features.
The proposed technique is trained on point-level annotations and uses a novel loss function that disentangles class-dependent and class-agnostic aspects of the model.
We present our results on a variety of object-counting/detection datasets, including FSOD and MS COCO.
arXiv Detail & Related papers (2020-07-03T18:23:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.