Adapting CLIP For Phrase Localization Without Further Training
- URL: http://arxiv.org/abs/2204.03647v1
- Date: Thu, 7 Apr 2022 17:59:38 GMT
- Title: Adapting CLIP For Phrase Localization Without Further Training
- Authors: Jiahao Li, Greg Shakhnarovich, Raymond A. Yeh
- Abstract summary: We propose to leverage contrastive language-vision models, CLIP, pre-trained on image and caption pairs.
We adapt CLIP to generate high-resolution spatial feature maps.
Our method for phrase localization requires no human annotations or additional training.
- Score: 30.467802103692378
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Supervised or weakly supervised methods for phrase localization (textual
grounding) either rely on human annotations or some other supervised models,
e.g., object detectors. Obtaining these annotations is labor-intensive and may
be difficult to scale in practice. We propose to leverage recent advances in
contrastive language-vision models, CLIP, pre-trained on image and caption
pairs collected from the internet. In its original form, CLIP only outputs an
image-level embedding without any spatial resolution. We adapt CLIP to generate
high-resolution spatial feature maps. Importantly, we can extract feature maps
from both ViT and ResNet CLIP model while maintaining the semantic properties
of an image embedding. This provides a natural framework for phrase
localization. Our method for phrase localization requires no human annotations
or additional training. Extensive experiments show that our method outperforms
existing no-training methods in zero-shot phrase localization, and in some
cases, it even outperforms supervised methods. Code is available at
https://github.com/pals-ttic/adapting-CLIP .
Related papers
- Contrastive Localized Language-Image Pre-Training [60.4967533101887]
Contrastive Language-Image Pre-training (CLIP) has been a celebrated method for training vision encoders to generate image/text representations.
We propose Contrastive Localized Language-Image Pre-training (CLOC) by complementing CLIP with region-text contrastive loss and modules.
CLOC enables high-quality regional embeddings for image region recognition and retrieval tasks.
arXiv Detail & Related papers (2024-10-03T17:56:09Z) - CLIP-DINOiser: Teaching CLIP a few DINO tricks for open-vocabulary semantic segmentation [31.264574799748903]
We propose an open-vocabulary semantic segmentation method, which does not require any annotations.
We show that the used self-supervised feature properties can directly be learnt from CLIP features.
Our method CLIP-DINOiser needs only a single forward pass of CLIP and two light convolutional layers at inference.
arXiv Detail & Related papers (2023-12-19T17:40:27Z) - SILC: Improving Vision Language Pretraining with Self-Distillation [113.50400246862056]
We introduce SILC, a novel framework for vision language pretraining.
SILC improves image-text contrastive learning with the simple addition of local-to-global correspondence learning by self-distillation.
We show that distilling local image features from an exponential moving average (EMA) teacher model significantly improves model performance on dense predictions tasks like detection and segmentation.
arXiv Detail & Related papers (2023-10-20T08:44:47Z) - CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense
Prediction [67.43527289422978]
We propose an approach named CLIPSelf, which adapts the image-level recognition ability of CLIP ViT to local image regions without needing any region-text pairs.
We achieve new state-of-the-art performance on open-vocabulary object detection, semantic segmentation, and panoptic segmentation across various benchmarks.
arXiv Detail & Related papers (2023-10-02T17:58:52Z) - CLIP-DIY: CLIP Dense Inference Yields Open-Vocabulary Semantic
Segmentation For-Free [12.15899043709721]
We propose an open-vocabulary semantic segmentation method, dubbed CLIP-DIY.
It exploits CLIP classification abilities on patches of different sizes and aggregates the decision in a single map.
We obtain state-of-the-art zero-shot semantic segmentation results on PASCAL VOC and perform on par with the best methods on COCO.
arXiv Detail & Related papers (2023-09-25T16:52:59Z) - CgT-GAN: CLIP-guided Text GAN for Image Captioning [48.276753091051035]
We propose CLIP-guided text GAN (CgT-GAN) to enable the model to "see" real visual modality.
We use adversarial training to teach CgT-GAN to mimic the phrases of an external text corpus.
CgT-GAN outperforms state-of-the-art methods significantly across all metrics.
arXiv Detail & Related papers (2023-08-23T10:25:37Z) - From Association to Generation: Text-only Captioning by Unsupervised
Cross-modal Mapping [20.67415815472257]
We propose a zero-shot method from association to generation for image captioning and video captioning.
Knight State-of-the-Art achieves performance in zero-shot methods for image captioning and video captioning.
arXiv Detail & Related papers (2023-04-26T04:06:20Z) - What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding
without Text Inputs [82.93345261434943]
Given an input image, and nothing else, our method returns the bounding boxes of objects in the image and phrases that describe the objects.
This is achieved within an open world paradigm, in which the objects in the input image may not have been encountered during the training of the localization mechanism.
Our work generalizes weakly supervised segmentation and phrase grounding and is shown empirically to outperform the state of the art in both domains.
arXiv Detail & Related papers (2022-06-19T09:07:30Z) - RegionCLIP: Region-based Language-Image Pretraining [94.29924084715316]
Contrastive language-image pretraining (CLIP) using image-text pairs has achieved impressive results on image classification.
We propose a new method called RegionCLIP that significantly extends CLIP to learn region-level visual representations.
Our method significantly outperforms the state of the art by 3.8 AP50 and 2.2 AP for novel categories on COCO and LVIS datasets.
arXiv Detail & Related papers (2021-12-16T18:39:36Z) - ClipCap: CLIP Prefix for Image Captioning [6.69087470775851]
We use CLIP encoding as a prefix to the caption, by employing a simple mapping network, and then fine-tunes a language model to generate the image captions.
We demonstrate our model achieves comparable results to state-of-the-art methods on the challenging Conceptual Captions and nocaps datasets.
arXiv Detail & Related papers (2021-11-18T14:49:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.