Expanding Zero-Shot Object Counting with Rich Prompts
- URL: http://arxiv.org/abs/2505.15398v2
- Date: Mon, 26 May 2025 05:35:36 GMT
- Title: Expanding Zero-Shot Object Counting with Rich Prompts
- Authors: Huilin Zhu, Senyao Li, Jingling Yuan, Zhengwei Yang, Yu Guo, Wenxuan Liu, Xian Zhong, Shengfeng He,
- Abstract summary: RichCount is a training strategy that enhances text encoding and strengthens the model's association with objects in images.<n>RichCount achieves state-of-the-art performance in zero-shot counting and significantly enhances generalization to unseen categories in open-world scenarios.
- Score: 34.63381285520037
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Expanding pre-trained zero-shot counting models to handle unseen categories requires more than simply adding new prompts, as this approach does not achieve the necessary alignment between text and visual features for accurate counting. We introduce RichCount, the first framework to address these limitations, employing a two-stage training strategy that enhances text encoding and strengthens the model's association with objects in images. RichCount improves zero-shot counting for unseen categories through two key objectives: (1) enriching text features with a feed-forward network and adapter trained on text-image similarity, thereby creating robust, aligned representations; and (2) applying this refined encoder to counting tasks, enabling effective generalization across diverse prompts and complex images. In this manner, RichCount goes beyond simple prompt expansion to establish meaningful feature alignment that supports accurate counting across novel categories. Extensive experiments on three benchmark datasets demonstrate the effectiveness of RichCount, achieving state-of-the-art performance in zero-shot counting and significantly enhancing generalization to unseen categories in open-world scenarios.
Related papers
- SDVPT: Semantic-Driven Visual Prompt Tuning for Open-World Object Counting [70.49268117587562]
We propose a plug-and-play Semantic-Driven Visual Prompt Tuning framework (SDVPT) that transfers knowledge from the training set to unseen categories.<n>During inference, we dynamically synthesize the visual prompts for unseen categories based on the semantic correlation between unseen and training categories.
arXiv Detail & Related papers (2025-04-24T09:31:08Z) - Generative Compositor for Few-Shot Visual Information Extraction [60.663887314625164]
We propose a novel generative model, named Generative generative spatialtor, to address the challenge of few-shot VIE.<n>Generative generative spatialtor is a hybrid pointer-generator network that emulates the operations of a compositor by retrieving words from the source text.<n>The proposed method achieves highly competitive results in the full-sample training, while notably outperforms the baseline in the 1-shot, 5-shot, and 10-shot settings.
arXiv Detail & Related papers (2025-03-21T04:56:24Z) - T2ICount: Enhancing Cross-modal Understanding for Zero-Shot Counting [20.21019748095159]
Zero-shot object counting aims to count instances of arbitrary object categories specified by text descriptions.<n>We present T2ICount, a diffusion-based framework that leverages rich prior knowledge and fine-grained visual understanding from pretrained diffusion models.
arXiv Detail & Related papers (2025-02-28T01:09:18Z) - CountGD: Multi-Modal Open-World Counting [54.88804890463491]
This paper aims to improve the generality and accuracy of open-vocabulary object counting in images.<n>We introduce the first open-world counting model, CountGD, where the prompt can be specified by a text description or visual exemplars or both.
arXiv Detail & Related papers (2024-07-05T16:20:48Z) - M$^3$Net: Multi-view Encoding, Matching, and Fusion for Few-shot
Fine-grained Action Recognition [80.21796574234287]
M$3$Net is a matching-based framework for few-shot fine-grained (FS-FG) action recognition.
It incorporates textitmulti-view encoding, textitmulti-view matching, and textitmulti-view fusion to facilitate embedding encoding, similarity matching, and decision making.
Explainable visualizations and experimental results demonstrate the superiority of M$3$Net in capturing fine-grained action details.
arXiv Detail & Related papers (2023-08-06T09:15:14Z) - Multi-Modal Classifiers for Open-Vocabulary Object Detection [104.77331131447541]
The goal of this paper is open-vocabulary object detection (OVOD)
We adopt a standard two-stage object detector architecture.
We explore three ways via: language descriptions, image exemplars, or a combination of the two.
arXiv Detail & Related papers (2023-06-08T18:31:56Z) - CLIP-Count: Towards Text-Guided Zero-Shot Object Counting [32.07271723717184]
We propose CLIP-Count, the first end-to-end pipeline that estimates density maps for open-vocabulary objects with text guidance in a zero-shot manner.
To align the text embedding with dense visual features, we introduce a patch-text contrastive loss that guides the model to learn informative patch-level visual representations for dense prediction.
Our method effectively generates high-quality density maps for objects-of-interest.
arXiv Detail & Related papers (2023-05-12T08:19:39Z) - Text2Model: Text-based Model Induction for Zero-shot Image Classification [38.704831945753284]
We address the challenge of building task-agnostic classifiers using only text descriptions.
We generate zero-shot classifiers using a hypernetwork that receives class descriptions and outputs a multi-class model.
We evaluate this approach in a series of zero-shot classification tasks, for image, point-cloud, and action recognition, using a range of text descriptions.
arXiv Detail & Related papers (2022-10-27T05:19:55Z) - CounTR: Transformer-based Generalised Visual Counting [94.54725247039441]
We develop a computational model for counting the number of objects from arbitrary semantic categories, using arbitrary number of "exemplars"
We conduct thorough ablation studies on the large-scale counting benchmark, e.g. FSC-147, and demonstrate state-of-the-art performance on both zero and few-shot settings.
arXiv Detail & Related papers (2022-08-29T17:02:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.