Related papers: CountGD++: Generalized Prompting for Open-World Counting

CountGD++: Generalized Prompting for Open-World Counting

URL: http://arxiv.org/abs/2512.23351v1
Date: Mon, 29 Dec 2025 10:23:22 GMT
Title: CountGD++: Generalized Prompting for Open-World Counting
Authors: Niki Amini-Naieni, Andrew Zisserman,
Abstract summary: We introduce novel capabilities that expand how the target object can be specified.<n> Specifically, we extend the prompt to enable what not to count to be described with text and/or visual examples.<n>We also introduce the concept of pseudo-exemplars' that automate the annotation of visual examples at inference.
Score: 54.61576076312857
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The flexibility and accuracy of methods for automatically counting objects in images and videos are limited by the way the object can be specified. While existing methods allow users to describe the target object with text and visual examples, the visual examples must be manually annotated inside the image, and there is no way to specify what not to count. To address these gaps, we introduce novel capabilities that expand how the target object can be specified. Specifically, we extend the prompt to enable what not to count to be described with text and/or visual examples, introduce the concept of `pseudo-exemplars' that automate the annotation of visual examples at inference, and extend counting models to accept visual examples from both natural and synthetic external images. We also use our new counting model, CountGD++, as a vision expert agent for an LLM. Together, these contributions expand the prompt flexibility of multi-modal open-world counting and lead to significant improvements in accuracy, efficiency, and generalization across multiple datasets. Code is available at https://github.com/niki-amini-naieni/CountGDPlusPlus.

Related papers

Expanding Zero-Shot Object Counting with Rich Prompts [34.63381285520037]
RichCount is a training strategy that enhances text encoding and strengthens the model's association with objects in images.<n>RichCount achieves state-of-the-art performance in zero-shot counting and significantly enhances generalization to unseen categories in open-world scenarios.
arXiv Detail & Related papers (2025-05-21T11:38:23Z)
Introducing Visual Perception Token into Multimodal Large Language Model [53.82301522384719]
Multimodal Large Language Model (MLLM) relies on the perception process of its vision encoder.<n>MLLM still lacks the autonomous capability to control its own visual perception processes.<n>We propose the concept of Visual Perception Token, aiming to empower MLLM with a mechanism to control its visual perception processes.
arXiv Detail & Related papers (2025-02-24T18:56:12Z)
CountGD: Multi-Modal Open-World Counting [54.88804890463491]
This paper aims to improve the generality and accuracy of open-vocabulary object counting in images.<n>We introduce the first open-world counting model, CountGD, where the prompt can be specified by a text description or visual exemplars or both.
arXiv Detail & Related papers (2024-07-05T16:20:48Z)
Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring [26.14137626882127]
We introduce a unified high-resolution generalist model, Griffon v2, enabling flexible object referring with visual and textual prompts.<n>We design a simple and lightweight down-sampling projector to overcome the input tokens constraint in Large Language Models.<n>Experiments demonstrate that Griffon v2 can localize objects of interest with visual and textual referring, achieve state-of-the-art performance on REC and phrase grounding, and outperform expert models in object detection, object counting, and REG.
arXiv Detail & Related papers (2024-03-14T12:21:37Z)
OmniCount: Multi-label Object Counting with Semantic-Geometric Priors [52.28092505350977]
This paper introduces a more practical approach enabling simultaneous counting of multiple object categories using an open-vocabulary framework.<n>Our solution, OmniCount, stands out by using semantic and geometric insights (priors) from pre-trained models to count multiple categories of objects as specified by users.<n>Our comprehensive evaluation in OmniCount-191, alongside other leading benchmarks, demonstrates OmniCount's exceptional performance, significantly outpacing existing solutions.
arXiv Detail & Related papers (2024-03-08T16:38:11Z)
AFreeCA: Annotation-Free Counting for All [17.581015609730017]
We introduce an unsupervised sorting methodology to learn object-related features that are subsequently refined and anchored for counting purposes. We also present a density classifier-guided method for dividing an image into patches containing objects that can be reliably counted.
arXiv Detail & Related papers (2024-03-07T23:18:34Z)
ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts [38.59120110371588]
We introduce a novel multimodal model capable of decoding arbitrary visual prompts. This allows users to intuitively mark images and interact with the model using natural cues like a "red bounding box" or "pointed arrow" Our simple design directly overlays visual markers onto the RGB image, eliminating the need for complex region encodings.
arXiv Detail & Related papers (2023-12-01T18:59:56Z)
Zero-Shot Object Counting with Language-Vision Models [50.1159882903028]
Class-agnostic object counting aims to count object instances of an arbitrary class at test time. Current methods require human-annotated exemplars as inputs which are often unavailable for novel categories. We propose zero-shot object counting (ZSC), a new setting where only the class name is available during test time.
arXiv Detail & Related papers (2023-09-22T14:48:42Z)
Edit As You Wish: Video Caption Editing with Multi-grained User Control [61.76233268900959]
We propose a novel textbfVideo textbfCaption textbfEditing textbf(VCE) task to automatically revise an existing video description guided by multi-grained user requests. Inspired by human writing-revision habits, we design the user command as a pivotal triplet textitoperation, position, attribute to cover diverse user needs from coarse-grained to fine-grained.
arXiv Detail & Related papers (2023-05-15T07:12:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.