Object Counting with GPT-4o and GPT-5: A Comparative Study
- URL: http://arxiv.org/abs/2512.03233v1
- Date: Tue, 02 Dec 2025 21:07:13 GMT
- Title: Object Counting with GPT-4o and GPT-5: A Comparative Study
- Authors: Richard Füzesséry, Kaziwa Saleh, Sándor Szénási, Zoltán Vámossy,
- Abstract summary: Zero-shot object counting attempts to estimate the number of object instances belonging to novel categories that the vision model performing the counting has never encountered during training.<n>Existing methods typically require large amount of annotated data and often require visual exemplars to guide the counting process.<n>Large language models (LLMs) are powerful tools with remarkable reasoning and data understanding abilities, which suggest the possibility of utilizing them for counting tasks without any supervision.
- Score: 2.624902795082451
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Zero-shot object counting attempts to estimate the number of object instances belonging to novel categories that the vision model performing the counting has never encountered during training. Existing methods typically require large amount of annotated data and often require visual exemplars to guide the counting process. However, large language models (LLMs) are powerful tools with remarkable reasoning and data understanding abilities, which suggest the possibility of utilizing them for counting tasks without any supervision. In this work we aim to leverage the visual capabilities of two multi-modal LLMs, GPT-4o and GPT-5, to perform object counting in a zero-shot manner using only textual prompts. We evaluate both models on the FSC-147 and CARPK datasets and provide a comparative analysis. Our findings show that the models achieve performance comparable to the state-of-the-art zero-shot approaches on FSC-147, in some cases, even surpass them.
Related papers
- Bootstrapping MLLM for Weakly-Supervised Class-Agnostic Object Counting [59.37613121962146]
We propose WS-COC, the first MLLM-driven weakly-supervised framework for class-agnostic object counting.<n> WS-COC matches or even surpasses many state-of-art fully-supervised methods while significantly reducing annotation costs.
arXiv Detail & Related papers (2026-02-13T09:58:35Z) - Can Current AI Models Count What We Mean, Not What They See? A Benchmark and Systematic Evaluation [21.90583276089241]
PairTally is a benchmark dataset designed to evaluate fine-grained visual counting.<n>Each of the 681 high-resolution images in PairTally contains two object categories.<n>We show that despite recent advances, current models struggle to reliably count what users intend.
arXiv Detail & Related papers (2025-09-17T13:06:58Z) - Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers [54.83459025465947]
Even the largest models struggle with compositional reasoning, generalization, fine-grained spatial and temporal reasoning, and counting.
Visual reasoning with large language models (LLMs) as controllers can, in principle, address these limitations by decomposing the task and solving subtasks by orchestrating a set of (visual) tools.
We present a framework that mitigates these issues by introducing spatially and temporally abstract routines and by leveraging a small number of labeled examples to automatically generate in-context examples.
arXiv Detail & Related papers (2024-01-03T20:48:47Z) - Revisiting Few-Shot Object Detection with Vision-Language Models [49.79495118650838]
We revisit the task of few-shot object detection (FSOD) in the context of recent foundational vision-language models (VLMs)
We propose Foundational FSOD, a new benchmark protocol that evaluates detectors pre-trained on any external data.
We discuss our recent CVPR 2024 Foundational FSOD competition and share insights from the community.
arXiv Detail & Related papers (2023-12-22T07:42:00Z) - One-Shot Open Affordance Learning with Foundation Models [54.15857111929812]
We introduce One-shot Open Affordance Learning (OOAL), where a model is trained with just one example per base object category.
We propose a vision-language framework with simple and effective designs that boost the alignment between visual features and affordance text embeddings.
Experiments on two affordance segmentation benchmarks show that the proposed method outperforms state-of-the-art models with less than 1% of the full training data.
arXiv Detail & Related papers (2023-11-29T16:23:06Z) - Zero-Shot Object Counting with Language-Vision Models [50.1159882903028]
Class-agnostic object counting aims to count object instances of an arbitrary class at test time.
Current methods require human-annotated exemplars as inputs which are often unavailable for novel categories.
We propose zero-shot object counting (ZSC), a new setting where only the class name is available during test time.
arXiv Detail & Related papers (2023-09-22T14:48:42Z) - Learning to Count without Annotations [23.506936202158013]
We propose UnCounTR, a model that can learn this task without requiring any manual annotations.
We construct "Self-Collages", images with various pasted objects as training samples, that provide a rich learning signal covering arbitrary object types and counts.
Our method builds on existing unsupervised representations and segmentation techniques to successfully demonstrate for the first time the ability of reference-based counting without manual supervision.
arXiv Detail & Related papers (2023-07-17T17:48:06Z) - CounTR: Transformer-based Generalised Visual Counting [94.54725247039441]
We develop a computational model for counting the number of objects from arbitrary semantic categories, using arbitrary number of "exemplars"
We conduct thorough ablation studies on the large-scale counting benchmark, e.g. FSC-147, and demonstrate state-of-the-art performance on both zero and few-shot settings.
arXiv Detail & Related papers (2022-08-29T17:02:45Z) - Multi-Modal Few-Shot Object Detection with Meta-Learning-Based
Cross-Modal Prompting [77.69172089359606]
We study multi-modal few-shot object detection (FSOD) in this paper, using both few-shot visual examples and class semantic information for detection.
Our approach is motivated by the high-level conceptual similarity of (metric-based) meta-learning and prompt-based learning.
We comprehensively evaluate the proposed multi-modal FSOD models on multiple few-shot object detection benchmarks, achieving promising results.
arXiv Detail & Related papers (2022-04-16T16:45:06Z) - A Few-Shot Sequential Approach for Object Counting [63.82757025821265]
We introduce a class attention mechanism that sequentially attends to objects in the image and extracts their relevant features.
The proposed technique is trained on point-level annotations and uses a novel loss function that disentangles class-dependent and class-agnostic aspects of the model.
We present our results on a variety of object-counting/detection datasets, including FSOD and MS COCO.
arXiv Detail & Related papers (2020-07-03T18:23:39Z) - Objectness-Aware Few-Shot Semantic Segmentation [31.13009111054977]
We show how to increase overall model capacity to achieve improved performance.
We introduce objectness, which is class-agnostic and so not prone to overfitting.
Given only one annotated example of an unseen category, experiments show that our method outperforms state-of-art methods with respect to mIoU.
arXiv Detail & Related papers (2020-04-06T19:12:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.