Substance or Style: What Does Your Image Embedding Know?
- URL: http://arxiv.org/abs/2307.05610v1
- Date: Mon, 10 Jul 2023 22:40:10 GMT
- Title: Substance or Style: What Does Your Image Embedding Know?
- Authors: Cyrus Rashtchian and Charles Herrmann and Chun-Sung Ferng and Ayan
Chakrabarti and Dilip Krishnan and Deqing Sun and Da-Cheng Juan and Andrew
Tomkins
- Abstract summary: Image foundation models have primarily been evaluated for semantic content.
We measure the visual content of embeddings along many axes, including image style, quality, and a range of natural and artificial transformations.
We find that image-text models (CLIP and ALIGN) are better at recognizing new examples of style transfer than masking-based models (CAN and MAE)
- Score: 55.676463077772866
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Probes are small networks that predict properties of underlying data from
embeddings, and they provide a targeted, effective way to illuminate the
information contained in embeddings. While analysis through the use of probes
has become standard in NLP, there has been much less exploration in vision.
Image foundation models have primarily been evaluated for semantic content.
Better understanding the non-semantic information in popular embeddings (e.g.,
MAE, SimCLR, or CLIP) will shed new light both on the training algorithms and
on the uses for these foundation models. We design a systematic transformation
prediction task and measure the visual content of embeddings along many axes,
including image style, quality, and a range of natural and artificial
transformations. Surprisingly, six embeddings (including SimCLR) encode enough
non-semantic information to identify dozens of transformations. We also
consider a generalization task, where we group similar transformations and hold
out several for testing. We find that image-text models (CLIP and ALIGN) are
better at recognizing new examples of style transfer than masking-based models
(CAN and MAE). Overall, our results suggest that the choice of pre-training
algorithm impacts the types of information in the embedding, and certain models
are better than others for non-semantic downstream tasks.
Related papers
- Exploring Simple Open-Vocabulary Semantic Segmentation [7.245983878396646]
Open-vocabulary semantic segmentation models aim to accurately assign a semantic label to each pixel in an image from a set of arbitrary open-vocabulary texts.
In this paper, we introduce S-Seg, a novel model that can achieve surprisingly strong performance without depending on any of the above elements.
arXiv Detail & Related papers (2024-01-22T18:59:29Z) - Mixture of Self-Supervised Learning [2.191505742658975]
Self-supervised learning works by using a pretext task which will be trained on the model before being applied to a specific task.
Previous studies have only used one type of transformation as a pretext task.
This raises the question of how it affects if more than one pretext task is used and to use a gating network to combine all pretext tasks.
arXiv Detail & Related papers (2023-07-27T14:38:32Z) - MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments [72.6405488990753]
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks.
We propose a single-stage and standalone method, MOCA, which unifies both desired properties.
We achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols.
arXiv Detail & Related papers (2023-07-18T15:46:20Z) - Effective Data Augmentation With Diffusion Models [65.09758931804478]
We address the lack of diversity in data augmentation with image-to-image transformations parameterized by pre-trained text-to-image diffusion models.
Our method edits images to change their semantics using an off-the-shelf diffusion model, and generalizes to novel visual concepts from a few labelled examples.
We evaluate our approach on few-shot image classification tasks, and on a real-world weed recognition task, and observe an improvement in accuracy in tested domains.
arXiv Detail & Related papers (2023-02-07T20:42:28Z) - ClipCrop: Conditioned Cropping Driven by Vision-Language Model [90.95403416150724]
We take advantage of vision-language models as a foundation for creating robust and user-intentional cropping algorithms.
We develop a method to perform cropping with a text or image query that reflects the user's intention as guidance.
Our pipeline design allows the model to learn text-conditioned aesthetic cropping with a small dataset.
arXiv Detail & Related papers (2022-11-21T14:27:07Z) - Survey on Self-supervised Representation Learning Using Image
Transformations [0.8098097078441623]
Self-supervised learning (SSL) is a technique used in unsupervised representation learning.
geometric transformations have shown to be powerful supervisory signals in SSL.
We shortlist six representative models that use image transformations including those based on predicting and autoencoding transformations.
Our analysis indicates the AETv2 performs the best in most settings.
arXiv Detail & Related papers (2022-02-17T08:37:50Z) - TransformNet: Self-supervised representation learning through predicting
geometric transformations [0.8098097078441623]
We describe the unsupervised semantic feature learning approach for recognition of the geometric transformation applied to the input data.
The basic concept of our approach is that if someone is unaware of the objects in the images, he/she would not be able to quantitatively predict the geometric transformation that was applied to them.
arXiv Detail & Related papers (2022-02-08T22:41:01Z) - A Comprehensive Study of Image Classification Model Sensitivity to
Foregrounds, Backgrounds, and Visual Attributes [58.633364000258645]
We call this dataset RIVAL10 consisting of roughly $26k$ instances over $10$ classes.
We evaluate the sensitivity of a broad set of models to noise corruptions in foregrounds, backgrounds and attributes.
In our analysis, we consider diverse state-of-the-art architectures (ResNets, Transformers) and training procedures (CLIP, SimCLR, DeiT, Adversarial Training)
arXiv Detail & Related papers (2022-01-26T06:31:28Z) - DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting [91.56988987393483]
We present a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP.
Specifically, we convert the original image-text matching problem in CLIP to a pixel-text matching problem and use the pixel-text score maps to guide the learning of dense prediction models.
Our method is model-agnostic, which can be applied to arbitrary dense prediction systems and various pre-trained visual backbones.
arXiv Detail & Related papers (2021-12-02T18:59:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.