Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for
Improved Vision-Language Compositionality
- URL: http://arxiv.org/abs/2305.13812v3
- Date: Tue, 24 Oct 2023 21:21:00 GMT
- Title: Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for
Improved Vision-Language Compositionality
- Authors: Harman Singh, Pengchuan Zhang, Qifan Wang, Mengjiao Wang, Wenhan
Xiong, Jingfei Du, Yu Chen
- Abstract summary: Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning.
Recent research has highlighted severe limitations in their ability to perform compositional reasoning over objects, attributes, and relations.
- Score: 50.48859793121308
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contrastively trained vision-language models have achieved remarkable
progress in vision and language representation learning, leading to
state-of-the-art models for various downstream multimodal tasks. However,
recent research has highlighted severe limitations of these models in their
ability to perform compositional reasoning over objects, attributes, and
relations. Scene graphs have emerged as an effective way to understand images
compositionally. These are graph-structured semantic representations of images
that contain objects, their attributes, and relations with other objects in a
scene. In this work, we consider the scene graph parsed from text as a proxy
for the image scene graph and propose a graph decomposition and augmentation
framework along with a coarse-to-fine contrastive learning objective between
images and text that aligns sentences of various complexities to the same
image. Along with this, we propose novel negative mining techniques in the
scene graph space for improving attribute binding and relation understanding.
Through extensive experiments, we demonstrate the effectiveness of our approach
that significantly improves attribute binding, relation understanding,
systematic generalization, and productivity on multiple recently proposed
benchmarks (For example, improvements upto $18\%$ for systematic
generalization, $16.5\%$ for relation understanding over a strong baseline),
while achieving similar or better performance than CLIP on various general
multimodal tasks.
Related papers
- Compositional Entailment Learning for Hyperbolic Vision-Language Models [54.41927525264365]
We show how to fully leverage the innate hierarchical nature of hyperbolic embeddings by looking beyond individual image-text pairs.
We propose Compositional Entailment Learning for hyperbolic vision-language models.
Empirical evaluation on a hyperbolic vision-language model trained with millions of image-text pairs shows that the proposed compositional learning approach outperforms conventional Euclidean CLIP learning.
arXiv Detail & Related papers (2024-10-09T14:12:50Z) - Multi-Modal Representation Learning with Text-Driven Soft Masks [48.19806080407593]
We propose a visual-linguistic representation learning approach within a self-supervised learning framework.
We generate diverse features for the image-text matching (ITM) task via soft-masking the regions in an image.
We identify the relevant regions to each word by computing the word-conditional visual attention using multi-modal encoder.
arXiv Detail & Related papers (2023-04-03T05:07:49Z) - SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense
Reasoning [61.57887011165744]
multimodal Transformers have made great progress in the task of Visual Commonsense Reasoning.
We propose a Scene Graph Enhanced Image-Text Learning framework to incorporate visual scene graphs in commonsense reasoning.
arXiv Detail & Related papers (2021-12-16T03:16:30Z) - From Show to Tell: A Survey on Image Captioning [48.98681267347662]
Connecting Vision and Language plays an essential role in Generative Intelligence.
Research in image captioning has not reached a conclusive answer yet.
This work aims at providing a comprehensive overview and categorization of image captioning approaches.
arXiv Detail & Related papers (2021-07-14T18:00:54Z) - Matching Visual Features to Hierarchical Semantic Topics for Image
Paragraph Captioning [50.08729005865331]
This paper develops a plug-and-play hierarchical-topic-guided image paragraph generation framework.
To capture the correlations between the image and text at multiple levels of abstraction, we design a variational inference network.
To guide the paragraph generation, the learned hierarchical topics and visual features are integrated into the language model.
arXiv Detail & Related papers (2021-05-10T06:55:39Z) - Exploring Explicit and Implicit Visual Relationships for Image
Captioning [11.82805641934772]
In this paper, we explore explicit and implicit visual relationships to enrich region-level representations for image captioning.
Explicitly, we build semantic graph over object pairs and exploit gated graph convolutional networks (Gated GCN) to selectively aggregate local neighbors' information.
Implicitly, we draw global interactions among the detected objects through region-based bidirectional encoder representations from transformers.
arXiv Detail & Related papers (2021-05-06T01:47:51Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.