Injecting Image Details into CLIP's Feature Space
- URL: http://arxiv.org/abs/2208.14649v4
- Date: Sun, 30 Jul 2023 13:35:19 GMT
- Title: Injecting Image Details into CLIP's Feature Space
- Authors: Zilun Zhang, Cuifeng Shen, Yuan Shen, Huixin Xiong, Xinyu Zhou
- Abstract summary: We introduce an efficient framework that can produce a single feature representation for a high-resolution image.
In the framework, we train a feature fusing model based on CLIP features extracted from a carefully designed image patch method.
We validate our framework by retrieving images from class prompted queries on the real world and synthetic datasets.
- Score: 29.450159407113155
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although CLIP-like Visual Language Models provide a functional joint feature
space for image and text, due to the limitation of the CILP-like model's image
input size (e.g., 224), subtle details are lost in the feature representation
if we input high-resolution images (e.g., 2240). In this work, we introduce an
efficient framework that can produce a single feature representation for a
high-resolution image that injects image details and shares the same semantic
space as the original CLIP. In the framework, we train a feature fusing model
based on CLIP features extracted from a carefully designed image patch method
that can cover objects of any scale, weakly supervised by image-agnostic class
prompted queries. We validate our framework by retrieving images from class
prompted queries on the real world and synthetic datasets, showing significant
performance improvement on these tasks. Furthermore, to fully demonstrate our
framework's detail retrieval ability, we construct a CLEVR-like synthetic
dataset called CLVER-DS, which is fully annotated and has a controllable object
scale.
Related papers
- DetailCLIP: Detail-Oriented CLIP for Fine-Grained Tasks [31.850184662606562]
We introduce DetailCLIP: A Detail-Oriented CLIP to address the limitations of contrastive learning-based vision-language models.
We show that DetailCLIP surpasses existing CLIP-based and traditional self-supervised learning (SSL) models in segmentation accuracy and exhibits superior generalization across diverse datasets.
arXiv Detail & Related papers (2024-09-10T18:27:36Z) - Selective Vision-Language Subspace Projection for Few-shot CLIP [55.361337202198925]
We introduce a method called Selective Vision-Language Subspace Projection (SSP)
SSP incorporates local image features and utilizes them as a bridge to enhance the alignment between image-text pairs.
Our approach entails only training-free matrix calculations and can be seamlessly integrated into advanced CLIP-based few-shot learning frameworks.
arXiv Detail & Related papers (2024-07-24T03:45:35Z) - Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification [54.96876797812238]
We present a novel CrOss-moDal nEighbor Representation(CODER) based on the distance structure between images and their neighbor texts.
The key to construct a high-quality CODER lies in how to create a vast amount of high-quality and diverse texts to match with images.
Experiment results across various datasets and models confirm CODER's effectiveness.
arXiv Detail & Related papers (2024-04-27T02:04:36Z) - CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement [65.47237619200442]
Contrastive language image pretraining (CLIP) is a standard method for training vision-language models.
We augment CLIP training with task-specific vision models from model zoos to improve its visual representations.
This simple setup shows substantial improvements of up to 16.3% across different vision tasks.
arXiv Detail & Related papers (2023-10-21T20:20:13Z) - Interpreting CLIP's Image Representation via Text-Based Decomposition [73.54377859089801]
We investigate the CLIP image encoder by analyzing how individual model components affect the final representation.
We decompose the image representation as a sum across individual image patches, model layers, and attention heads.
We use this understanding to remove spurious features from CLIP and to create a strong zero-shot image segmenter.
arXiv Detail & Related papers (2023-10-09T17:59:04Z) - Zero-Shot Visual Classification with Guided Cropping [9.321383320998262]
We propose an off-the-shelf zero-shot object detection model in a preprocessing step to increase focus of zero-shot classifier to the object of interest.
We empirically show that our approach improves zero-shot classification results across architectures and datasets, favorably for small objects.
arXiv Detail & Related papers (2023-09-12T20:09:12Z) - Composed Image Retrieval using Contrastive Learning and Task-oriented
CLIP-based Features [32.138956674478116]
Given a query composed of a reference image and a relative caption, the Composed Image Retrieval goal is to retrieve images visually similar to the reference one.
We use features from the OpenAI CLIP model to tackle the considered task.
We train a Combiner network that learns to combine the image-text features integrating the bimodal information.
arXiv Detail & Related papers (2023-08-22T15:03:16Z) - CLIP2GAN: Towards Bridging Text with the Latent Space of GANs [128.47600914674985]
We propose a novel framework, i.e., CLIP2GAN, by leveraging CLIP model and StyleGAN.
The key idea of our CLIP2GAN is to bridge the output feature embedding space of CLIP and the input latent space of StyleGAN.
arXiv Detail & Related papers (2022-11-28T04:07:17Z) - OSIC: A New One-Stage Image Captioner Coined [38.46732302316068]
We propose a novel One-Stage Image Captioner (OSIC) with dynamic multi-sight learning.
To obtain rich features, we use the Swin Transformer to calculate multi-level features.
To enhance the global modeling of encoder for caption, we propose a new dual-dimensional refining module.
arXiv Detail & Related papers (2022-11-04T08:50:09Z) - A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained
Vision-language Model [61.58071099082296]
It is unclear how to make zero-shot recognition working well on broader vision problems, such as object detection and semantic segmentation.
In this paper, we target for zero-shot semantic segmentation, by building it on an off-the-shelf pre-trained vision-language model, i.e., CLIP.
Our experimental results show that this simple framework surpasses previous state-of-the-arts by a large margin.
arXiv Detail & Related papers (2021-12-29T18:56:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.