Related papers: Compositional Entailment Learning for Hyperbolic Vision-Language Models

Compositional Entailment Learning for Hyperbolic Vision-Language Models

URL: http://arxiv.org/abs/2410.06912v1
Date: Wed, 9 Oct 2024 14:12:50 GMT
Title: Compositional Entailment Learning for Hyperbolic Vision-Language Models
Authors: Avik Pal, Max van Spengler, Guido Maria D'Amely di Melendugno, Alessandro Flaborea, Fabio Galasso, Pascal Mettes,
Abstract summary: We show how to fully leverage the innate hierarchical nature of hyperbolic embeddings by looking beyond individual image-text pairs. We propose Compositional Entailment Learning for hyperbolic vision-language models. Empirical evaluation on a hyperbolic vision-language model trained with millions of image-text pairs shows that the proposed compositional learning approach outperforms conventional Euclidean CLIP learning.
Score: 54.41927525264365
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Image-text representation learning forms a cornerstone in vision-language models, where pairs of images and textual descriptions are contrastively aligned in a shared embedding space. Since visual and textual concepts are naturally hierarchical, recent work has shown that hyperbolic space can serve as a high-potential manifold to learn vision-language representation with strong downstream performance. In this work, for the first time we show how to fully leverage the innate hierarchical nature of hyperbolic embeddings by looking beyond individual image-text pairs. We propose Compositional Entailment Learning for hyperbolic vision-language models. The idea is that an image is not only described by a sentence but is itself a composition of multiple object boxes, each with their own textual description. Such information can be obtained freely by extracting nouns from sentences and using openly available localized grounding models. We show how to hierarchically organize images, image boxes, and their textual descriptions through contrastive and entailment-based objectives. Empirical evaluation on a hyperbolic vision-language model trained with millions of image-text pairs shows that the proposed compositional learning approach outperforms conventional Euclidean CLIP learning, as well as recent hyperbolic alternatives, with better zero-shot and retrieval generalization and clearly stronger hierarchical performance.

Related papers

SmartCLIP: Modular Vision-language Alignment with Identification Guarantees [59.16312652369709]
Contrastive Language-Image Pre-training (CLIP)citepradford2021learning has emerged as a pivotal model in computer vision and multimodal learning.<n>CLIP struggles with potential information misalignment in many image-text datasets and suffers from entangled representation.<n>We introduce ours, a novel approach that identifies and aligns the most relevant visual and textual representations in a modular manner.
arXiv Detail & Related papers (2025-07-29T22:26:20Z)
TIPS: Text-Image Pretraining with Spatial Awareness [13.38247732379754]
Self-supervised image-only pretraining is still the go-to method for many vision applications. We propose a novel general-purpose image-text model, which can be effectively used off-the-shelf for dense and global vision tasks.
arXiv Detail & Related papers (2024-10-21T21:05:04Z)
Hierarchical Text-to-Vision Self Supervised Alignment for Improved Histopathology Representation Learning [64.1316997189396]
We present a novel language-tied self-supervised learning framework, Hierarchical Language-tied Self-Supervision (HLSS) for histopathology images. Our resulting model achieves state-of-the-art performance on two medical imaging benchmarks, OpenSRH and TCGA datasets.
arXiv Detail & Related papers (2024-03-21T17:58:56Z)
Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality [50.48859793121308]
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning. Recent research has highlighted severe limitations in their ability to perform compositional reasoning over objects, attributes, and relations.
arXiv Detail & Related papers (2023-05-23T08:28:38Z)
Visual Clues: Bridging Vision and Language Foundations for Image Paragraph Captioning [78.07495777674747]
We argue that by using visual clues to bridge large pretrained vision foundation models and language models, we can do so without any extra cross-modal training. Thanks to the strong zero-shot capability of foundation models, we start by constructing a rich semantic representation of the image. We use large language model to produce a series of comprehensive descriptions for the visual content, which is then verified by the vision model again to select the candidate that aligns best with the image.
arXiv Detail & Related papers (2022-06-03T22:33:09Z)
Matching Visual Features to Hierarchical Semantic Topics for Image Paragraph Captioning [50.08729005865331]
This paper develops a plug-and-play hierarchical-topic-guided image paragraph generation framework. To capture the correlations between the image and text at multiple levels of abstraction, we design a variational inference network. To guide the paragraph generation, the learned hierarchical topics and visual features are integrated into the language model.
arXiv Detail & Related papers (2021-05-10T06:55:39Z)
Learning to Represent Image and Text with Denotation Graph [32.417311523031195]
We propose learning representations from a set of implied, visually grounded expressions between image and text. We show that state-of-the-art multimodal learning models can be further improved by leveraging automatically harvested structural relations.
arXiv Detail & Related papers (2020-10-06T18:00:58Z)
Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation. Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning. During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.