ReCLIP: A Strong Zero-Shot Baseline for Referring Expression
Comprehension
- URL: http://arxiv.org/abs/2204.05991v1
- Date: Tue, 12 Apr 2022 17:55:38 GMT
- Title: ReCLIP: A Strong Zero-Shot Baseline for Referring Expression
Comprehension
- Authors: Sanjay Subramanian, Will Merrill, Trevor Darrell, Matt Gardner, Sameer
Singh, Anna Rohrbach
- Abstract summary: Large-scale pre-trained models are useful for image classification across domains.
We present ReCLIP, a simple but strong zero-shot baseline that repurposes CLIP, a state-of-the-art large-scale model, for ReC.
- Score: 114.85628613911713
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training a referring expression comprehension (ReC) model for a new visual
domain requires collecting referring expressions, and potentially corresponding
bounding boxes, for images in the domain. While large-scale pre-trained models
are useful for image classification across domains, it remains unclear if they
can be applied in a zero-shot manner to more complex tasks like ReC. We present
ReCLIP, a simple but strong zero-shot baseline that repurposes CLIP, a
state-of-the-art large-scale model, for ReC. Motivated by the close connection
between ReC and CLIP's contrastive pre-training objective, the first component
of ReCLIP is a region-scoring method that isolates object proposals via
cropping and blurring, and passes them to CLIP. However, through controlled
experiments on a synthetic dataset, we find that CLIP is largely incapable of
performing spatial reasoning off-the-shelf. Thus, the second component of
ReCLIP is a spatial relation resolver that handles several types of spatial
relations. We reduce the gap between zero-shot baselines from prior work and
supervised models by as much as 29% on RefCOCOg, and on RefGTA (video game
imagery), ReCLIP's relative improvement over supervised ReC models trained on
real images is 8%.
Related papers
- FLEX-CLIP: Feature-Level GEneration Network Enhanced CLIP for X-shot Cross-modal Retrieval [10.26297663751352]
Few-shot cross-modal retrieval (CMR) retrieves semantically similar instances in another modality with the target domain.
vision-language pretraining methods like CLIP have shown great few-shot or zero-shot learning performance.
To tackle these issues, we propose FLEX-CLIP, a novel Feature-level Generation Network Enhanced CLIP.
arXiv Detail & Related papers (2024-11-26T14:12:14Z) - ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference [32.852004564832455]
We re-investigate the architecture of CLIP, and identify residual connections as the primary source of noise that degrades segmentation quality.
We propose ClearCLIP, a novel approach that decomposes CLIP's representations to enhance open-vocabulary semantic segmentation.
arXiv Detail & Related papers (2024-07-17T09:52:20Z) - Semantic Compositions Enhance Vision-Language Contrastive Learning [46.985865191341944]
We show that the zero-shot classification and retrieval capabilities of CLIP-like models can be improved significantly through the introduction of semantically composite examples during pretraining.
Our method fuses the captions and blends 50% of each image to form a new composite sample.
The benefits of CLIP-C are particularly pronounced in settings with relatively limited pretraining data.
arXiv Detail & Related papers (2024-07-01T15:58:20Z) - Retrieval-Enhanced Contrastive Vision-Text Models [61.783728119255365]
We propose to equip vision-text models with the ability to refine their embedding with cross-modal retrieved information from a memory at inference time.
Remarkably, we show that this can be done with a light-weight, single-layer, fusion transformer on top of a frozen CLIP.
Our experiments validate that our retrieval-enhanced contrastive (RECO) training improves CLIP performance substantially on several challenging fine-grained tasks.
arXiv Detail & Related papers (2023-06-12T15:52:02Z) - Continual Contrastive Finetuning Improves Low-Resource Relation
Extraction [34.76128090845668]
Relation extraction has been particularly challenging in low-resource scenarios and domains.
Recent literature has tackled low-resource RE by self-supervised learning.
We propose to pretrain and finetune the RE model using consistent objectives of contrastive learning.
arXiv Detail & Related papers (2022-12-21T07:30:22Z) - Unsupervised Deep Learning Meets Chan-Vese Model [77.24463525356566]
We propose an unsupervised image segmentation approach that integrates the Chan-Vese (CV) model with deep neural networks.
Our basic idea is to apply a deep neural network that maps the image into a latent space to alleviate the violation of the piecewise constant assumption in image space.
arXiv Detail & Related papers (2022-04-14T13:23:57Z) - No Token Left Behind: Explainability-Aided Image Classification and
Generation [79.4957965474334]
We present a novel explainability-based approach, which adds a loss term to ensure that CLIP focuses on all relevant semantic parts of the input.
Our method yields an improvement in the recognition rate, without additional training or fine-tuning.
arXiv Detail & Related papers (2022-04-11T07:16:39Z) - One-Shot Adaptation of GAN in Just One CLIP [51.188396199083336]
We present a novel single-shot GAN adaptation method through unified CLIP space manipulations.
Specifically, our model employs a two-step training strategy: reference image search in the source generator using a CLIP-guided latent optimization.
We show that our model generates diverse outputs with the target texture and outperforms the baseline models both qualitatively and quantitatively.
arXiv Detail & Related papers (2022-03-17T13:03:06Z) - A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained
Vision-language Model [61.58071099082296]
It is unclear how to make zero-shot recognition working well on broader vision problems, such as object detection and semantic segmentation.
In this paper, we target for zero-shot semantic segmentation, by building it on an off-the-shelf pre-trained vision-language model, i.e., CLIP.
Our experimental results show that this simple framework surpasses previous state-of-the-arts by a large margin.
arXiv Detail & Related papers (2021-12-29T18:56:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.