Text-To-Concept (and Back) via Cross-Model Alignment
- URL: http://arxiv.org/abs/2305.06386v1
- Date: Wed, 10 May 2023 18:01:06 GMT
- Title: Text-To-Concept (and Back) via Cross-Model Alignment
- Authors: Mazda Moayeri, Keivan Rezaei, Maziar Sanjabi, Soheil Feizi
- Abstract summary: We show that mapping between an image's representation in one model to its representation in another can be learned surprisingly well with just a linear layer.
We convert fixed off-the-shelf vision encoders to surprisingly strong zero-shot classifiers for free.
We show other immediate use-cases of text-to-concept, like building concept bottleneck models with no concept supervision.
- Score: 48.133333356834186
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We observe that the mapping between an image's representation in one model to
its representation in another can be learned surprisingly well with just a
linear layer, even across diverse models. Building on this observation, we
propose $\textit{text-to-concept}$, where features from a fixed pretrained
model are aligned linearly to the CLIP space, so that text embeddings from
CLIP's text encoder become directly comparable to the aligned features. With
text-to-concept, we convert fixed off-the-shelf vision encoders to surprisingly
strong zero-shot classifiers for free, with accuracy at times even surpassing
that of CLIP, despite being much smaller models and trained on a small fraction
of the data compared to CLIP. We show other immediate use-cases of
text-to-concept, like building concept bottleneck models with no concept
supervision, diagnosing distribution shifts in terms of human concepts, and
retrieving images satisfying a set of text-based constraints. Lastly, we
demonstrate the feasibility of $\textit{concept-to-text}$, where vectors in a
model's feature space are decoded by first aligning to the CLIP before being
fed to a GPT-based generative model. Our work suggests existing deep models,
with presumably diverse architectures and training, represent input samples
relatively similarly, and a two-way communication across model representation
spaces and to humans (through language) is viable.
Related papers
- Explain via Any Concept: Concept Bottleneck Model with Open Vocabulary Concepts [8.028021897214238]
"OpenCBM" is the first CBM with concepts of open vocabularies.
Our model significantly outperforms the previous state-of-the-art CBM by 9% in the classification accuracy on the benchmark dataset CUB-200-2011.
arXiv Detail & Related papers (2024-08-05T06:42:00Z) - FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models [56.71672127740099]
We focus on the task of image segmentation, which is traditionally solved by training models on closed-vocabulary datasets.
We leverage different and relatively small-sized, open-source foundation models for zero-shot open-vocabulary segmentation.
Our approach (dubbed FreeSeg-Diff), which does not rely on any training, outperforms many training-based approaches on both Pascal VOC and COCO datasets.
arXiv Detail & Related papers (2024-03-29T10:38:25Z) - Interpreting CLIP: Insights on the Robustness to ImageNet Distribution Shifts [22.74552390076515]
We probe the representation spaces of 16 robust zero-shot CLIP vision encoders with various backbones and pretraining sets.
We detect the presence of outlier features in robust zero-shot CLIP vision encoders, which to the best of our knowledge is the first time these are observed in non-transformer models.
We find the existence of outlier features to be an indication of ImageNet shift robustness in models, since we only find them in robust models in our analysis.
arXiv Detail & Related papers (2023-10-19T17:59:12Z) - CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model [55.321010757641524]
We introduce CLIP4STR, a simple yet effective STR method built upon image and text encoders of CLIP.
We scale CLIP4STR in terms of the model size, pre-training data, and training data, achieving state-of-the-art performance on 11 STR benchmarks.
arXiv Detail & Related papers (2023-05-23T12:51:20Z) - STAIR: Learning Sparse Text and Image Representation in Grounded Tokens [84.14528645941128]
We show that it is possible to build a sparse semantic representation that is as powerful as, or even better than, dense presentations.
We extend the CLIP model and build a sparse text and image representation (STAIR), where the image and text are mapped to a sparse token space.
It significantly outperforms a CLIP model with +$4.9%$ and +$4.3%$ absolute Recall@1 improvement.
arXiv Detail & Related papers (2023-01-30T17:21:30Z) - No Token Left Behind: Explainability-Aided Image Classification and
Generation [79.4957965474334]
We present a novel explainability-based approach, which adds a loss term to ensure that CLIP focuses on all relevant semantic parts of the input.
Our method yields an improvement in the recognition rate, without additional training or fine-tuning.
arXiv Detail & Related papers (2022-04-11T07:16:39Z) - DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting [91.56988987393483]
We present a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP.
Specifically, we convert the original image-text matching problem in CLIP to a pixel-text matching problem and use the pixel-text score maps to guide the learning of dense prediction models.
Our method is model-agnostic, which can be applied to arbitrary dense prediction systems and various pre-trained visual backbones.
arXiv Detail & Related papers (2021-12-02T18:59:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.