TIER: Text-Image Entropy Regularization for CLIP-style models
- URL: http://arxiv.org/abs/2212.06710v1
- Date: Tue, 13 Dec 2022 16:29:13 GMT
- Title: TIER: Text-Image Entropy Regularization for CLIP-style models
- Authors: Anil Palepu, Andrew L. Beam
- Abstract summary: In CLIP-style models, text-token embeddings should have high similarity to only a small number of image-patch embeddings.
We formalize this observation using a novel regularization scheme that penalizes the entropy of the text-token to image-patch similarity scores.
We show that the proposed regularization scheme shrinks the text-token and image-patch similarity scores towards zero, thus achieving the desired effect.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we study the effect of a novel regularization scheme on
contrastive language-image pre-trained (CLIP) models. Our approach is based on
the observation that, in many domains, text tokens should only describe a small
number of image regions and, likewise, each image region should correspond to
only a few text tokens. In CLIP-style models, this implies that text-token
embeddings should have high similarity to only a small number of image-patch
embeddings for a given image-text pair. We formalize this observation using a
novel regularization scheme that penalizes the entropy of the text-token to
image-patch similarity scores. We qualitatively and quantitatively demonstrate
that the proposed regularization scheme shrinks the text-token and image-patch
similarity scores towards zero, thus achieving the desired effect. We
demonstrate the promise of our approach in an important medical context where
this underlying hypothesis naturally arises. Using our proposed approach, we
achieve state of the art (SOTA) zero-shot performance on all tasks from the
CheXpert chest x-ray dataset, outperforming an unregularized version of the
model and several recently published self-supervised models.
Related papers
- Debiasing Vison-Language Models with Text-Only Training [15.069736314663352]
We propose a Text-Only Debiasing framework called TOD, leveraging a text-as-image training paradigm to mitigate visual biases.
To address the limitations, we propose a Text-Only Debiasing framework called TOD, leveraging a text-as-image training paradigm to mitigate visual biases.
arXiv Detail & Related papers (2024-10-12T04:34:46Z) - Text Augmented Spatial-aware Zero-shot Referring Image Segmentation [60.84423786769453]
We introduce a Text Augmented Spatial-aware (TAS) zero-shot referring image segmentation framework.
TAS incorporates a mask proposal network for instance-level mask extraction, a text-augmented visual-text matching score for mining the image-text correlation, and a spatial for mask post-processing.
The proposed method clearly outperforms state-of-the-art zero-shot referring image segmentation methods.
arXiv Detail & Related papers (2023-10-27T10:52:50Z) - ITI-GEN: Inclusive Text-to-Image Generation [56.72212367905351]
This study investigates inclusive text-to-image generative models that generate images based on human-written prompts.
We show that, for some attributes, images can represent concepts more expressively than text.
We propose a novel approach, ITI-GEN, that leverages readily available reference images for Inclusive Text-to-Image GENeration.
arXiv Detail & Related papers (2023-09-11T15:54:30Z) - Divide, Evaluate, and Refine: Evaluating and Improving Text-to-Image
Alignment with Iterative VQA Feedback [20.78162037954646]
We introduce a decompositional approach towards evaluation and improvement of text-to-image alignment.
Human user studies indicate that the proposed approach surpasses previous state-of-the-art by 8.7% in overall text-to-image alignment accuracy.
arXiv Detail & Related papers (2023-07-10T17:54:57Z) - Text-Conditioned Sampling Framework for Text-to-Image Generation with
Masked Generative Models [52.29800567587504]
We propose a learnable sampling model, Text-Conditioned Token Selection (TCTS), to select optimal tokens via localized supervision with text information.
TCTS improves not only the image quality but also the semantic alignment of the generated images with the given texts.
We validate the efficacy of TCTS combined with Frequency Adaptive Sampling (FAS) with various generative tasks, demonstrating that it significantly outperforms the baselines in image-text alignment and image quality.
arXiv Detail & Related papers (2023-04-04T03:52:49Z) - STAIR: Learning Sparse Text and Image Representation in Grounded Tokens [84.14528645941128]
We show that it is possible to build a sparse semantic representation that is as powerful as, or even better than, dense presentations.
We extend the CLIP model and build a sparse text and image representation (STAIR), where the image and text are mapped to a sparse token space.
It significantly outperforms a CLIP model with +$4.9%$ and +$4.3%$ absolute Recall@1 improvement.
arXiv Detail & Related papers (2023-01-30T17:21:30Z) - eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert
Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis.
We train an ensemble of text-to-image diffusion models specialized for different stages synthesis.
Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z) - CyCLIP: Cyclic Contrastive Language-Image Pretraining [34.588147979731374]
Recent advances in contrastive representation learning over paired image-text data have led to models such as CLIP that achieve state-of-the-art performance for zero-shot classification and distributional robustness.
We demonstrate that the image and text representations learned via a standard contrastive objective are not interchangeable and can lead to inconsistent downstream predictions.
We propose CyCLIP, a framework for contrastive representation learning that explicitly optimize for the learned representations to be geometrically consistent in the image and text space.
arXiv Detail & Related papers (2022-05-28T15:31:17Z) - No Token Left Behind: Explainability-Aided Image Classification and
Generation [79.4957965474334]
We present a novel explainability-based approach, which adds a loss term to ensure that CLIP focuses on all relevant semantic parts of the input.
Our method yields an improvement in the recognition rate, without additional training or fine-tuning.
arXiv Detail & Related papers (2022-04-11T07:16:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.