CLIP-Count: Towards Text-Guided Zero-Shot Object Counting
- URL: http://arxiv.org/abs/2305.07304v2
- Date: Thu, 10 Aug 2023 04:04:37 GMT
- Title: CLIP-Count: Towards Text-Guided Zero-Shot Object Counting
- Authors: Ruixiang Jiang, Lingbo Liu, Changwen Chen
- Abstract summary: We propose CLIP-Count, the first end-to-end pipeline that estimates density maps for open-vocabulary objects with text guidance in a zero-shot manner.
To align the text embedding with dense visual features, we introduce a patch-text contrastive loss that guides the model to learn informative patch-level visual representations for dense prediction.
Our method effectively generates high-quality density maps for objects-of-interest.
- Score: 32.07271723717184
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in visual-language models have shown remarkable zero-shot
text-image matching ability that is transferable to downstream tasks such as
object detection and segmentation. Adapting these models for object counting,
however, remains a formidable challenge. In this study, we first investigate
transferring vision-language models (VLMs) for class-agnostic object counting.
Specifically, we propose CLIP-Count, the first end-to-end pipeline that
estimates density maps for open-vocabulary objects with text guidance in a
zero-shot manner. To align the text embedding with dense visual features, we
introduce a patch-text contrastive loss that guides the model to learn
informative patch-level visual representations for dense prediction. Moreover,
we design a hierarchical patch-text interaction module to propagate semantic
information across different resolution levels of visual features. Benefiting
from the full exploitation of the rich image-text alignment knowledge of
pretrained VLMs, our method effectively generates high-quality density maps for
objects-of-interest. Extensive experiments on FSC-147, CARPK, and ShanghaiTech
crowd counting datasets demonstrate state-of-the-art accuracy and
generalizability of the proposed method. Code is available:
https://github.com/songrise/CLIP-Count.
Related papers
- ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification [52.405499816861635]
Multiple instance learning (MIL)-based framework has become the mainstream for processing the whole slide image (WSI)
We propose a dual-scale vision-language multiple instance learning (ViLa-MIL) framework for whole slide image classification.
arXiv Detail & Related papers (2025-02-12T13:28:46Z) - Enhancing Visual Representation for Text-based Person Searching [9.601697802095119]
VFE-TPS is a Visual Feature Enhanced Text-based Person Search model.
It introduces a pre-trained backbone CLIP to learn basic multimodal features.
It constructs Text Guided Masked Image Modeling task to enhance the model's ability of learning local visual details.
arXiv Detail & Related papers (2024-12-30T01:38:14Z) - LMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation [16.864086165056698]
Existing open-vocabulary approaches leverage vision-language models, such as CLIP, to align visual features with rich semantic features acquired through pre-training on large-scale vision-language datasets.
We propose to alleviate the issues by leveraging multiple large-scale models to enhance the alignment between fine-grained visual features and enriched linguistic features.
Our method achieves state-of-the-art performance across all major open-vocabulary segmentation benchmarks.
arXiv Detail & Related papers (2024-11-30T05:49:42Z) - Open-Vocabulary Camouflaged Object Segmentation [66.94945066779988]
We introduce a new task, open-vocabulary camouflaged object segmentation (OVCOS)
We construct a large-scale complex scene dataset (textbfOVCamo) containing 11,483 hand-selected images with fine annotations and corresponding object classes.
By integrating the guidance of class semantic knowledge and the supplement of visual structure cues from the edge and depth information, the proposed method can efficiently capture camouflaged objects.
arXiv Detail & Related papers (2023-11-19T06:00:39Z) - SILC: Improving Vision Language Pretraining with Self-Distillation [113.50400246862056]
We introduce SILC, a novel framework for vision language pretraining.
SILC improves image-text contrastive learning with the simple addition of local-to-global correspondence learning by self-distillation.
We show that distilling local image features from an exponential moving average (EMA) teacher model significantly improves model performance on dense predictions tasks like detection and segmentation.
arXiv Detail & Related papers (2023-10-20T08:44:47Z) - Advancing Visual Grounding with Scene Knowledge: Benchmark and Method [74.72663425217522]
Visual grounding (VG) aims to establish fine-grained alignment between vision and language.
Most existing VG datasets are constructed using simple description texts.
We propose a novel benchmark of underlineScene underlineKnowledge-guided underlineVisual underlineGrounding.
arXiv Detail & Related papers (2023-07-21T13:06:02Z) - LPN: Language-guided Prototypical Network for few-shot classification [16.37959398470535]
Few-shot classification aims to adapt to new tasks with limited labeled examples.
Recent methods explore suitable measures for the similarity between the query and support images.
We propose a Language-guided Prototypical Network (LPN) for few-shot classification.
arXiv Detail & Related papers (2023-07-04T06:54:01Z) - OSIC: A New One-Stage Image Captioner Coined [38.46732302316068]
We propose a novel One-Stage Image Captioner (OSIC) with dynamic multi-sight learning.
To obtain rich features, we use the Swin Transformer to calculate multi-level features.
To enhance the global modeling of encoder for caption, we propose a new dual-dimensional refining module.
arXiv Detail & Related papers (2022-11-04T08:50:09Z) - Fine-Grained Semantically Aligned Vision-Language Pre-Training [151.7372197904064]
Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks.
Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts.
We introduce LO, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions.
arXiv Detail & Related papers (2022-08-04T07:51:48Z) - DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting [91.56988987393483]
We present a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP.
Specifically, we convert the original image-text matching problem in CLIP to a pixel-text matching problem and use the pixel-text score maps to guide the learning of dense prediction models.
Our method is model-agnostic, which can be applied to arbitrary dense prediction systems and various pre-trained visual backbones.
arXiv Detail & Related papers (2021-12-02T18:59:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.