Related papers: Text encoders bottleneck compositionality in contrastive vision-language models

Text encoders bottleneck compositionality in contrastive vision-language models

URL: http://arxiv.org/abs/2305.14897v2
Date: Mon, 30 Oct 2023 17:57:47 GMT
Title: Text encoders bottleneck compositionality in contrastive vision-language models
Authors: Amita Kamath, Jack Hessel, Kai-Wei Chang
Abstract summary: We train text-only recovery probes that aim to reconstruct captions from single-vector text representations. We find that CLIP's text encoder falls short on more compositional inputs. Results suggest text-only recoverability is a necessary (but not sufficient) condition for modeling compositional factors.
Score: 76.2406963762722
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Performant vision-language (VL) models like CLIP represent captions using a single vector. How much information about language is lost in this bottleneck? We first curate CompPrompts, a set of increasingly compositional image captions that VL models should be able to capture (e.g., single object, to object+property, to multiple interacting objects). Then, we train text-only recovery probes that aim to reconstruct captions from single-vector text representations produced by several VL models. This approach does not require images, allowing us to test on a broader range of scenes compared to prior work. We find that: 1) CLIP's text encoder falls short on more compositional inputs, including object relationships, attribute-object association, counting, and negations; 2) some text encoders work significantly better than others; and 3) text-only recovery performance predicts multi-modal matching performance on ControlledImCaps: a new evaluation benchmark we collect and release consisting of fine-grained compositional images and captions. Specifically, our results suggest text-only recoverability is a necessary (but not sufficient) condition for modeling compositional factors in contrastive VL models. We release our datasets and code.

Related papers

Adding simple structure at inference improves Vision-Language Compositionality [15.785274903236663]
In this paper, we propose to add simple structure at inference, where, given an image and a caption, we divide the image into different smaller crops.<n>We find that our approach consistently improves the performance of evaluated Vision-Language Models without any training.
arXiv Detail & Related papers (2025-06-11T13:06:25Z)
ComAlign: Compositional Alignment in Vision-Language Models [2.3250871476216814]
We introduce Compositional Alignment (ComAlign) to discover more exact correspondence of text and image components. Our methodology emphasizes that the compositional structure extracted from the text modality must also be retained in the image modality. We train a lightweight network lying on top of existing visual and language encoders using a small dataset.
arXiv Detail & Related papers (2024-09-12T16:46:41Z)
Improving Compositional Attribute Binding in Text-to-Image Generative Models via Enhanced Text Embeddings [46.723653095494896]
We investigate compositional attribute binding failures in text-to-image generative models. We show that imperfect text conditioning with CLIP text-encoder is one of the primary reasons behind the inability of these models to generate high-fidelity compositional scenes. Our main finding shows that significant compositional improvements can be achieved without harming the model's FID score.
arXiv Detail & Related papers (2024-06-12T03:21:34Z)
VCR: Visual Caption Restoration [80.24176572093512]
We introduce Visual Caption Restoration (VCR), a vision-language task that challenges models to accurately restore partially obscured texts using pixel-level hints within images. This task stems from the observation that text embedded in images is intrinsically different from common visual elements and natural language due to the need to align the modalities of vision, text, and text embedded in images.
arXiv Detail & Related papers (2024-06-10T16:58:48Z)
3VL: Using Trees to Improve Vision-Language Models' Interpretability [40.678288227161936]
Vision-Language models (VLMs) have proven to be effective at aligning image and text representations, producing superior zero-shot results when transferred to many downstream tasks. These representations suffer from some key shortcomings in understanding Compositional Language Concepts (CLC), such as recognizing objects' attributes, states, and relations between different objects. In this work, we introduce the architecture and training technique of Tree-augmented Vision-Language (3VL) model accompanied by our proposed Anchor inference method and Differential Relevance (DiRe) interpretability tool.
arXiv Detail & Related papers (2023-12-28T20:26:03Z)
Sieve: Multimodal Dataset Pruning Using Image Captioning Models [11.362835828985494]
Vision-Language Models (VLMs) are pretrained on large, diverse, and noisy web-crawled datasets. We argue that this approach suffers from multiple limitations including false positives and negatives due to CLIP's pretraining on noisy labels. We propose a pruning signal, Sieve, that employs synthetic captions generated by image-captioning models pretrained on small, diverse, and well-aligned image-text pairs.
arXiv Detail & Related papers (2023-10-03T14:53:53Z)
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model [78.32081709802873]
Most vision-language foundation models employ image-text datasets for pretraining. We propose COSA, a COncatenated SAmple pretrained vision-language foundation model. We achieve this by sequentially concatenating multiple image-text pairs as inputs for pretraining. This transformation effectively converts existing image-text corpora into a pseudo long-form video-paragraph corpus.
arXiv Detail & Related papers (2023-06-15T12:29:42Z)
Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models [45.36305540697616]
Vision and Language (VL) models offer an effective method for aligning representation spaces of images and text. The aligned image-text spaces learned by all the popular VL models are still suffering from the so-called object bias'
arXiv Detail & Related papers (2023-05-31T06:36:41Z)
VicTR: Video-conditioned Text Representations for Activity Recognition [73.09929391614266]
We argue that better video-VLMs can be designed by focusing more on augmenting text, rather than visual information. We introduce Video-conditioned Text Representations (VicTR), a form of text embeddings optimized w.r.t. visual embeddings. Our model can further make use of freely-available semantic information, in the form of visually-grounded auxiliary text.
arXiv Detail & Related papers (2023-04-05T16:30:36Z)
Language Quantized AutoEncoders: Towards Unsupervised Text-Image Alignment [81.73717488887938]
Language-Quantized AutoEncoder (LQAE) learns to align text-image data in an unsupervised manner by leveraging pretrained language models. LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs. This enables few-shot image classification with large language models (e.g., GPT-3) as well as linear classification of images based on BERT text features.
arXiv Detail & Related papers (2023-02-02T06:38:44Z)
ComCLIP: Training-Free Compositional Image and Text Matching [19.373706257771673]
Contrastive Language-Image Pretraining has demonstrated great zero-shot performance for matching images and text. We propose a novel textbftextittraining-free compositional CLIP model (ComCLIP) ComCLIP disentangles input images into subjects, objects, and action sub-images and composes CLIP's vision encoder and text encoder to perform evolving matching over compositional text embedding and sub-image embeddings.
arXiv Detail & Related papers (2022-11-25T01:37:48Z)
Fine-grained Image Captioning with CLIP Reward [104.71533106301598]
We propose using CLIP, a multimodal encoder trained on huge image-text pairs from web, to calculate multimodal similarity and use it as a reward function. We also propose a simple finetuning strategy of the CLIP text encoder to improve grammar that does not require extra text annotation. In experiments on text-to-image retrieval and FineCapEval, the proposed CLIP-guided model generates more distinctive captions than the CIDEr-optimized model.
arXiv Detail & Related papers (2022-05-26T02:46:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.