Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL
Models
- URL: http://arxiv.org/abs/2305.19595v2
- Date: Thu, 1 Jun 2023 16:16:02 GMT
- Title: Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL
Models
- Authors: Sivan Doveh, Assaf Arbelle, Sivan Harary, Roei Herzig, Donghyun Kim,
Paola Cascante-bonilla, Amit Alfassy, Rameswar Panda, Raja Giryes, Rogerio
Feris, Shimon Ullman, Leonid Karlinsky
- Abstract summary: Vision and Language (VL) models offer an effective method for aligning representation spaces of images and text.
The aligned image-text spaces learned by all the popular VL models are still suffering from the so-called object bias'
- Score: 45.36305540697616
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision and Language (VL) models offer an effective method for aligning
representation spaces of images and text, leading to numerous applications such
as cross-modal retrieval, visual question answering, captioning, and more.
However, the aligned image-text spaces learned by all the popular VL models are
still suffering from the so-called `object bias' - their representations behave
as `bags of nouns', mostly ignoring or downsizing the attributes, relations,
and states of objects described/appearing in texts/images. Although some great
attempts at fixing these `compositional reasoning' issues were proposed in the
recent literature, the problem is still far from being solved. In this paper,
we uncover two factors limiting the VL models' compositional reasoning
performance. These two factors are properties of the paired VL dataset used for
finetuning and pre-training the VL model: (i) the caption quality, or in other
words `image-alignment', of the texts; and (ii) the `density' of the captions
in the sense of mentioning all the details appearing on the image. We propose a
fine-tuning approach for automatically treating these factors leveraging a
standard VL dataset (CC3M). Applied to CLIP, we demonstrate its significant
compositional reasoning performance increase of up to $\sim27\%$ over the base
model, up to $\sim20\%$ over the strongest baseline, and by $6.7\%$ on average.
Related papers
- Removing Distributional Discrepancies in Captions Improves Image-Text Alignment [76.31530836622694]
We introduce a model designed to improve the prediction of image-text alignment.
Our approach focuses on generating high-quality training datasets for the alignment task.
We also demonstrate the applicability of our model by ranking the images generated by text-to-image models based on text alignment.
arXiv Detail & Related papers (2024-10-01T17:50:17Z) - Object-Attribute Binding in Text-to-Image Generation: Evaluation and Control [58.37323932401379]
Current diffusion models create images given a text prompt as input but struggle to correctly bind attributes mentioned in the text to the right objects in the image.
We propose focused cross-attention (FCA) that controls the visual attention maps by syntactic constraints found in the input sentence.
We show substantial improvements in T2I generation and especially its attribute-object binding on several datasets.
arXiv Detail & Related papers (2024-04-21T20:26:46Z) - Image2Sentence based Asymmetrical Zero-shot Composed Image Retrieval [92.13664084464514]
The task of composed image retrieval (CIR) aims to retrieve images based on the query image and the text describing the users' intent.
Existing methods have made great progress with the advanced large vision-language (VL) model in CIR task, however, they generally suffer from two main issues: lack of labeled triplets for model training and difficulty of deployment on resource-restricted environments.
We propose Image2Sentence based Asymmetric zero-shot composed image retrieval (ISA), which takes advantage of the VL model and only relies on unlabeled images for composition learning.
arXiv Detail & Related papers (2024-03-03T07:58:03Z) - 3VL: using Trees to teach Vision & Language models compositional
concepts [45.718319397947056]
We introduce the Tree-augmented Vision-Language (3VL) model architecture and training technique.
We show how Anchor, a simple technique for text unification, can be employed to filter nuisance factors.
We also exhibit how DiRe, which performs a differential relevancy comparison between VLM maps, enables us to generate compelling visualizations of a model's success or failure.
arXiv Detail & Related papers (2023-12-28T20:26:03Z) - Improving Compositional Text-to-image Generation with Large
Vision-Language Models [26.202725136839632]
compositional text-to-image models frequently encounter difficulties in generating high-quality images that align with input texts.
We employ large vision-language models (LVLMs) for multi-dimensional assessment of the alignment between generated images and their corresponding input texts.
Our experimental results validate that the proposed methodology significantly improves text-image alignment in compositional image generation.
arXiv Detail & Related papers (2023-10-10T05:09:05Z) - Text encoders bottleneck compositionality in contrastive vision-language
models [76.2406963762722]
We train text-only recovery probes that aim to reconstruct captions from single-vector text representations.
We find that CLIP's text encoder falls short on more compositional inputs.
Results suggest text-only recoverability is a necessary (but not sufficient) condition for modeling compositional factors.
arXiv Detail & Related papers (2023-05-24T08:48:44Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.