Related papers: Adapting CLIP For Phrase Localization Without Further Training

Adapting CLIP For Phrase Localization Without Further Training

URL: http://arxiv.org/abs/2204.03647v1
Date: Thu, 7 Apr 2022 17:59:38 GMT
Title: Adapting CLIP For Phrase Localization Without Further Training
Authors: Jiahao Li, Greg Shakhnarovich, Raymond A. Yeh
Abstract summary: We propose to leverage contrastive language-vision models, CLIP, pre-trained on image and caption pairs. We adapt CLIP to generate high-resolution spatial feature maps. Our method for phrase localization requires no human annotations or additional training.
Score: 30.467802103692378
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Supervised or weakly supervised methods for phrase localization (textual grounding) either rely on human annotations or some other supervised models, e.g., object detectors. Obtaining these annotations is labor-intensive and may be difficult to scale in practice. We propose to leverage recent advances in contrastive language-vision models, CLIP, pre-trained on image and caption pairs collected from the internet. In its original form, CLIP only outputs an image-level embedding without any spatial resolution. We adapt CLIP to generate high-resolution spatial feature maps. Importantly, we can extract feature maps from both ViT and ResNet CLIP model while maintaining the semantic properties of an image embedding. This provides a natural framework for phrase localization. Our method for phrase localization requires no human annotations or additional training. Extensive experiments show that our method outperforms existing no-training methods in zero-shot phrase localization, and in some cases, it even outperforms supervised methods. Code is available at https://github.com/pals-ttic/adapting-CLIP .

Related papers

TeD-Loc: Text Distillation for Weakly Supervised Object Localization [13.412674368913747]
TeD-Loc is an approach that distills knowledge from CLIP text embeddings into the model backbone and produces patch-level localization. It improves Top-1 LOC accuracy over state-of-the-art models by about 5% on both CUB and ILSVRC datasets.
arXiv Detail & Related papers (2025-01-22T04:36:17Z)
Contrastive Localized Language-Image Pre-Training [60.4967533101887]
Contrastive Language-Image Pre-training (CLIP) has been a celebrated method for training vision encoders to generate image/text representations. We propose Contrastive Localized Language-Image Pre-training (CLOC) by complementing CLIP with region-text contrastive loss and modules. CLOC enables high-quality regional embeddings for image region recognition and retrieval tasks.
arXiv Detail & Related papers (2024-10-03T17:56:09Z)
CLIP-DINOiser: Teaching CLIP a few DINO tricks for open-vocabulary semantic segmentation [31.264574799748903]
We propose an open-vocabulary semantic segmentation method, which does not require any annotations. We show that the used self-supervised feature properties can directly be learnt from CLIP features. Our method CLIP-DINOiser needs only a single forward pass of CLIP and two light convolutional layers at inference.
arXiv Detail & Related papers (2023-12-19T17:40:27Z)
SILC: Improving Vision Language Pretraining with Self-Distillation [113.50400246862056]
We introduce SILC, a novel framework for vision language pretraining. SILC improves image-text contrastive learning with the simple addition of local-to-global correspondence learning by self-distillation. We show that distilling local image features from an exponential moving average (EMA) teacher model significantly improves model performance on dense predictions tasks like detection and segmentation.
arXiv Detail & Related papers (2023-10-20T08:44:47Z)
CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction [67.43527289422978]
We propose an approach named CLIPSelf, which adapts the image-level recognition ability of CLIP ViT to local image regions without needing any region-text pairs. We achieve new state-of-the-art performance on open-vocabulary object detection, semantic segmentation, and panoptic segmentation across various benchmarks.
arXiv Detail & Related papers (2023-10-02T17:58:52Z)
CLIP-DIY: CLIP Dense Inference Yields Open-Vocabulary Semantic Segmentation For-Free [12.15899043709721]
We propose an open-vocabulary semantic segmentation method, dubbed CLIP-DIY. It exploits CLIP classification abilities on patches of different sizes and aggregates the decision in a single map. We obtain state-of-the-art zero-shot semantic segmentation results on PASCAL VOC and perform on par with the best methods on COCO.
arXiv Detail & Related papers (2023-09-25T16:52:59Z)
CgT-GAN: CLIP-guided Text GAN for Image Captioning [48.276753091051035]
We propose CLIP-guided text GAN (CgT-GAN) to enable the model to "see" real visual modality. We use adversarial training to teach CgT-GAN to mimic the phrases of an external text corpus. CgT-GAN outperforms state-of-the-art methods significantly across all metrics.
arXiv Detail & Related papers (2023-08-23T10:25:37Z)
From Association to Generation: Text-only Captioning by Unsupervised Cross-modal Mapping [20.67415815472257]
We propose a zero-shot method from association to generation for image captioning and video captioning. Knight State-of-the-Art achieves performance in zero-shot methods for image captioning and video captioning.
arXiv Detail & Related papers (2023-04-26T04:06:20Z)
What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding without Text Inputs [82.93345261434943]
Given an input image, and nothing else, our method returns the bounding boxes of objects in the image and phrases that describe the objects. This is achieved within an open world paradigm, in which the objects in the input image may not have been encountered during the training of the localization mechanism. Our work generalizes weakly supervised segmentation and phrase grounding and is shown empirically to outperform the state of the art in both domains.
arXiv Detail & Related papers (2022-06-19T09:07:30Z)
RegionCLIP: Region-based Language-Image Pretraining [94.29924084715316]
Contrastive language-image pretraining (CLIP) using image-text pairs has achieved impressive results on image classification. We propose a new method called RegionCLIP that significantly extends CLIP to learn region-level visual representations. Our method significantly outperforms the state of the art by 3.8 AP50 and 2.2 AP for novel categories on COCO and LVIS datasets.
arXiv Detail & Related papers (2021-12-16T18:39:36Z)
ClipCap: CLIP Prefix for Image Captioning [6.69087470775851]
We use CLIP encoding as a prefix to the caption, by employing a simple mapping network, and then fine-tunes a language model to generate the image captions. We demonstrate our model achieves comparable results to state-of-the-art methods on the challenging Conceptual Captions and nocaps datasets.
arXiv Detail & Related papers (2021-11-18T14:49:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.