Related papers: Can Current AI Models Count What We Mean, Not What They See? A Benchmark and Systematic Evaluation

Can Current AI Models Count What We Mean, Not What They See? A Benchmark and Systematic Evaluation

URL: http://arxiv.org/abs/2509.13939v1
Date: Wed, 17 Sep 2025 13:06:58 GMT
Title: Can Current AI Models Count What We Mean, Not What They See? A Benchmark and Systematic Evaluation
Authors: Gia Khanh Nguyen, Yifeng Huang, Minh Hoai,
Abstract summary: PairTally is a benchmark dataset designed to evaluate fine-grained visual counting.<n>Each of the 681 high-resolution images in PairTally contains two object categories.<n>We show that despite recent advances, current models struggle to reliably count what users intend.
Score: 21.90583276089241
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Visual counting is a fundamental yet challenging task, especially when users need to count objects of a specific type in complex scenes. While recent models, including class-agnostic counting models and large vision-language models (VLMs), show promise in counting tasks, their ability to perform fine-grained, intent-driven counting remains unclear. In this paper, we introduce PairTally, a benchmark dataset specifically designed to evaluate fine-grained visual counting. Each of the 681 high-resolution images in PairTally contains two object categories, requiring models to distinguish and count based on subtle differences in shape, size, color, or semantics. The dataset includes both inter-category (distinct categories) and intra-category (closely related subcategories) settings, making it suitable for rigorous evaluation of selective counting capabilities. We benchmark a variety of state-of-the-art models, including exemplar-based methods, language-prompted models, and large VLMs. Our results show that despite recent advances, current models struggle to reliably count what users intend, especially in fine-grained and visually ambiguous cases. PairTally provides a new foundation for diagnosing and improving fine-grained visual counting systems.

Related papers

Assessing the Visual Enumeration Abilities of Specialized Counting Architectures and Vision-Language Models [5.310444614342132]
multimodal vision-language models (VLMs) may offer a flexible alternative for open-set object counting.<n>VLMs can approximately enumerate the number of items in a visual scene, matching or even surpassing the performance of specialized computer vision architectures.<n>None of the models can reliably count the number of objects in complex visual scenes.
arXiv Detail & Related papers (2025-12-17T09:56:25Z)
Object Counting with GPT-4o and GPT-5: A Comparative Study [2.624902795082451]
Zero-shot object counting attempts to estimate the number of object instances belonging to novel categories that the vision model performing the counting has never encountered during training.<n>Existing methods typically require large amount of annotated data and often require visual exemplars to guide the counting process.<n>Large language models (LLMs) are powerful tools with remarkable reasoning and data understanding abilities, which suggest the possibility of utilizing them for counting tasks without any supervision.
arXiv Detail & Related papers (2025-12-02T21:07:13Z)
LVLM-COUNT: Enhancing the Counting Ability of Large Vision-Language Models [5.892066196730199]
Large vision-language models (LVLMs) are known to struggle with counting tasks.<n>We propose a simple yet effective baseline method that enhances LVLMs' counting ability for large numbers of objects.<n>We demonstrate the effectiveness of this approach across various datasets and benchmarks.
arXiv Detail & Related papers (2024-12-01T05:50:22Z)
Envisioning Class Entity Reasoning by Large Language Models for Few-shot Learning [13.68867780184022]
Few-shot learning aims to recognize new concepts using a limited number of visual samples. Our framework incorporates both the abstract class semantics and the concrete class entities extracted from Large Language Models (LLMs) For the challenging one-shot setting, our approach, utilizing the ResNet-12 backbone, achieves an average improvement of 1.95% over the second-best competitor.
arXiv Detail & Related papers (2024-08-22T15:10:20Z)
Zero-Shot Object Counting with Language-Vision Models [50.1159882903028]
Class-agnostic object counting aims to count object instances of an arbitrary class at test time. Current methods require human-annotated exemplars as inputs which are often unavailable for novel categories. We propose zero-shot object counting (ZSC), a new setting where only the class name is available during test time.
arXiv Detail & Related papers (2023-09-22T14:48:42Z)
Anchor Points: Benchmarking Models with Much Fewer Examples [88.02417913161356]
In six popular language classification benchmarks, model confidence in the correct class on many pairs of points is strongly correlated across models. We propose Anchor Point Selection, a technique to select small subsets of datasets that capture model behavior across the entire dataset. Just several anchor points can be used to estimate model per-class predictions on all other points in a dataset with low mean absolute error.
arXiv Detail & Related papers (2023-09-14T17:45:51Z)
Learning from Pseudo-labeled Segmentation for Multi-Class Object Counting [35.652092907690694]
Class-agnostic counting (CAC) has numerous potential applications across various domains. The goal is to count objects of an arbitrary category during testing, based on only a few annotated exemplars. We show that the segmentation model trained on these pseudo-labeled masks can effectively localize objects of interest for an arbitrary multi-class image.
arXiv Detail & Related papers (2023-07-15T01:33:19Z)
Exploiting Category Names for Few-Shot Classification with Vision-Language Models [78.51975804319149]
Vision-language foundation models pretrained on large-scale data provide a powerful tool for many visual understanding tasks. This paper shows that we can significantly improve the performance of few-shot classification by using the category names to initialize the classification head.
arXiv Detail & Related papers (2022-11-29T21:08:46Z)
Multi-Modal Few-Shot Object Detection with Meta-Learning-Based Cross-Modal Prompting [77.69172089359606]
We study multi-modal few-shot object detection (FSOD) in this paper, using both few-shot visual examples and class semantic information for detection. Our approach is motivated by the high-level conceptual similarity of (metric-based) meta-learning and prompt-based learning. We comprehensively evaluate the proposed multi-modal FSOD models on multiple few-shot object detection benchmarks, achieving promising results.
arXiv Detail & Related papers (2022-04-16T16:45:06Z)
Semantic Representation and Dependency Learning for Multi-Label Image Recognition [76.52120002993728]
We propose a novel and effective semantic representation and dependency learning (SRDL) framework to learn category-specific semantic representation for each category. Specifically, we design a category-specific attentional regions (CAR) module to generate channel/spatial-wise attention matrices to guide model. We also design an object erasing (OE) module to implicitly learn semantic dependency among categories by erasing semantic-aware regions.
arXiv Detail & Related papers (2022-04-08T00:55:15Z)
Synthesizing the Unseen for Zero-shot Object Detection [72.38031440014463]
We propose to synthesize visual features for unseen classes, so that the model learns both seen and unseen objects in the visual domain. We use a novel generative model that uses class-semantics to not only generate the features but also to discriminatively separate them.
arXiv Detail & Related papers (2020-10-19T12:36:11Z)
A Few-Shot Sequential Approach for Object Counting [63.82757025821265]
We introduce a class attention mechanism that sequentially attends to objects in the image and extracts their relevant features. The proposed technique is trained on point-level annotations and uses a novel loss function that disentangles class-dependent and class-agnostic aspects of the model. We present our results on a variety of object-counting/detection datasets, including FSOD and MS COCO.
arXiv Detail & Related papers (2020-07-03T18:23:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.