Text encoders bottleneck compositionality in contrastive vision-language
models
- URL: http://arxiv.org/abs/2305.14897v2
- Date: Mon, 30 Oct 2023 17:57:47 GMT
- Title: Text encoders bottleneck compositionality in contrastive vision-language
models
- Authors: Amita Kamath, Jack Hessel, Kai-Wei Chang
- Abstract summary: We train text-only recovery probes that aim to reconstruct captions from single-vector text representations.
We find that CLIP's text encoder falls short on more compositional inputs.
Results suggest text-only recoverability is a necessary (but not sufficient) condition for modeling compositional factors.
- Score: 76.2406963762722
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Performant vision-language (VL) models like CLIP represent captions using a
single vector. How much information about language is lost in this bottleneck?
We first curate CompPrompts, a set of increasingly compositional image captions
that VL models should be able to capture (e.g., single object, to
object+property, to multiple interacting objects). Then, we train text-only
recovery probes that aim to reconstruct captions from single-vector text
representations produced by several VL models. This approach does not require
images, allowing us to test on a broader range of scenes compared to prior
work. We find that: 1) CLIP's text encoder falls short on more compositional
inputs, including object relationships, attribute-object association, counting,
and negations; 2) some text encoders work significantly better than others; and
3) text-only recovery performance predicts multi-modal matching performance on
ControlledImCaps: a new evaluation benchmark we collect and release consisting
of fine-grained compositional images and captions. Specifically, our results
suggest text-only recoverability is a necessary (but not sufficient) condition
for modeling compositional factors in contrastive VL models. We release our
datasets and code.
Related papers
- ComAlign: Compositional Alignment in Vision-Language Models [2.3250871476216814]
We introduce Compositional Alignment (ComAlign) to discover more exact correspondence of text and image components.
Our methodology emphasizes that the compositional structure extracted from the text modality must also be retained in the image modality.
We train a lightweight network lying on top of existing visual and language encoders using a small dataset.
arXiv Detail & Related papers (2024-09-12T16:46:41Z) - VCR: Visual Caption Restoration [80.24176572093512]
We introduce Visual Caption Restoration (VCR), a vision-language task that challenges models to accurately restore partially obscured texts using pixel-level hints within images.
This task stems from the observation that text embedded in images is intrinsically different from common visual elements and natural language due to the need to align the modalities of vision, text, and text embedded in images.
arXiv Detail & Related papers (2024-06-10T16:58:48Z) - Sieve: Multimodal Dataset Pruning Using Image Captioning Models [11.362835828985494]
Vision-Language Models (VLMs) are pretrained on large, diverse, and noisy web-crawled datasets.
We argue that this approach suffers from multiple limitations including false positives and negatives due to CLIP's pretraining on noisy labels.
We propose a pruning signal, Sieve, that employs synthetic captions generated by image-captioning models pretrained on small, diverse, and well-aligned image-text pairs.
arXiv Detail & Related papers (2023-10-03T14:53:53Z) - COSA: Concatenated Sample Pretrained Vision-Language Foundation Model [78.32081709802873]
Most vision-language foundation models employ image-text datasets for pretraining.
We propose COSA, a COncatenated SAmple pretrained vision-language foundation model.
We achieve this by sequentially concatenating multiple image-text pairs as inputs for pretraining.
This transformation effectively converts existing image-text corpora into a pseudo long-form video-paragraph corpus.
arXiv Detail & Related papers (2023-06-15T12:29:42Z) - Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL
Models [45.36305540697616]
Vision and Language (VL) models offer an effective method for aligning representation spaces of images and text.
The aligned image-text spaces learned by all the popular VL models are still suffering from the so-called object bias'
arXiv Detail & Related papers (2023-05-31T06:36:41Z) - VicTR: Video-conditioned Text Representations for Activity Recognition [73.09929391614266]
We argue that better video-VLMs can be designed by focusing more on augmenting text, rather than visual information.
We introduce Video-conditioned Text Representations (VicTR), a form of text embeddings optimized w.r.t. visual embeddings.
Our model can further make use of freely-available semantic information, in the form of visually-grounded auxiliary text.
arXiv Detail & Related papers (2023-04-05T16:30:36Z) - Language Quantized AutoEncoders: Towards Unsupervised Text-Image
Alignment [81.73717488887938]
Language-Quantized AutoEncoder (LQAE) learns to align text-image data in an unsupervised manner by leveraging pretrained language models.
LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs.
This enables few-shot image classification with large language models (e.g., GPT-3) as well as linear classification of images based on BERT text features.
arXiv Detail & Related papers (2023-02-02T06:38:44Z) - ComCLIP: Training-Free Compositional Image and Text Matching [19.373706257771673]
Contrastive Language-Image Pretraining has demonstrated great zero-shot performance for matching images and text.
We propose a novel textbftextittraining-free compositional CLIP model (ComCLIP)
ComCLIP disentangles input images into subjects, objects, and action sub-images and composes CLIP's vision encoder and text encoder to perform evolving matching over compositional text embedding and sub-image embeddings.
arXiv Detail & Related papers (2022-11-25T01:37:48Z) - Fine-grained Image Captioning with CLIP Reward [104.71533106301598]
We propose using CLIP, a multimodal encoder trained on huge image-text pairs from web, to calculate multimodal similarity and use it as a reward function.
We also propose a simple finetuning strategy of the CLIP text encoder to improve grammar that does not require extra text annotation.
In experiments on text-to-image retrieval and FineCapEval, the proposed CLIP-guided model generates more distinctive captions than the CIDEr-optimized model.
arXiv Detail & Related papers (2022-05-26T02:46:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.