Improving fine-grained understanding in image-text pre-training
- URL: http://arxiv.org/abs/2401.09865v1
- Date: Thu, 18 Jan 2024 10:28:45 GMT
- Title: Improving fine-grained understanding in image-text pre-training
- Authors: Ioana Bica, Anastasija Ili\'c, Matthias Bauer, Goker Erdogan, Matko
Bo\v{s}njak, Christos Kaplanis, Alexey A. Gritsenko, Matthias Minderer,
Charles Blundell, Razvan Pascanu, Jovana Mitrovi\'c
- Abstract summary: We introduce SPARse Fine-grained Contrastive Alignment (SPARC), a simple method for pretraining more fine-grained multimodal representations from image-text pairs.
We show improved performance over competing approaches over both image-level tasks relying on coarse-grained information.
- Score: 37.163228122323865
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We introduce SPARse Fine-grained Contrastive Alignment (SPARC), a simple
method for pretraining more fine-grained multimodal representations from
image-text pairs. Given that multiple image patches often correspond to single
words, we propose to learn a grouping of image patches for every token in the
caption. To achieve this, we use a sparse similarity metric between image
patches and language tokens and compute for each token a language-grouped
vision embedding as the weighted average of patches. The token and
language-grouped vision embeddings are then contrasted through a fine-grained
sequence-wise loss that only depends on individual samples and does not require
other batch samples as negatives. This enables more detailed information to be
learned in a computationally inexpensive manner. SPARC combines this
fine-grained loss with a contrastive loss between global image and text
embeddings to learn representations that simultaneously encode global and local
information. We thoroughly evaluate our proposed method and show improved
performance over competing approaches both on image-level tasks relying on
coarse-grained information, e.g. classification, as well as region-level tasks
relying on fine-grained information, e.g. retrieval, object detection, and
segmentation. Moreover, SPARC improves model faithfulness and captioning in
foundational vision-language models.
Related papers
- ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification [52.405499816861635]
Multiple instance learning (MIL)-based framework has become the mainstream for processing the whole slide image (WSI)
We propose a dual-scale vision-language multiple instance learning (ViLa-MIL) framework for whole slide image classification.
arXiv Detail & Related papers (2025-02-12T13:28:46Z) - DiffCLIP: Few-shot Language-driven Multimodal Classifier [19.145645804307566]
DiffCLIP is a novel framework that extends Contrastive Language-Image Pretraining.
It conveys comprehensive language-driven semantic information for accurate classification of high-dimensional multimodal remote sensing images.
DiffCLIP achieves an overall accuracy improvement of 10.65% across three remote sensing datasets compared with CLIP.
arXiv Detail & Related papers (2024-12-10T02:21:39Z) - FLAIR: VLM with Fine-grained Language-informed Image Representations [49.2684130383925]
FLAIR is an approach that utilizes long and detailed image descriptions to learn localized image embeddings.
Our experiments demonstrate the effectiveness of FLAIR trained on 30M image-text pairs in capturing fine-grained visual information.
arXiv Detail & Related papers (2024-12-04T18:56:04Z) - Patch-Wise Self-Supervised Visual Representation Learning: A Fine-Grained Approach [4.9204263448542465]
This study introduces an innovative, fine-grained dimension by integrating patch-level discrimination into self-supervised visual representation learning.
We employ a distinctive photometric patch-level augmentation, where each patch is individually augmented, independent from other patches within the same view.
We present a simple yet effective patch-matching algorithm to find the corresponding patches across the augmented views.
arXiv Detail & Related papers (2023-10-28T09:35:30Z) - Text Augmented Spatial-aware Zero-shot Referring Image Segmentation [60.84423786769453]
We introduce a Text Augmented Spatial-aware (TAS) zero-shot referring image segmentation framework.
TAS incorporates a mask proposal network for instance-level mask extraction, a text-augmented visual-text matching score for mining the image-text correlation, and a spatial for mask post-processing.
The proposed method clearly outperforms state-of-the-art zero-shot referring image segmentation methods.
arXiv Detail & Related papers (2023-10-27T10:52:50Z) - ALIP: Adaptive Language-Image Pre-training with Synthetic Caption [78.93535202851278]
Contrastive Language-Image Pre-training (CLIP) has significantly boosted the performance of various vision-language tasks.
The presence of intrinsic noise and unmatched image-text pairs in web data can potentially affect the performance of representation learning.
We propose an Adaptive Language-Image Pre-training (ALIP), a bi-path model that integrates supervision from both raw text and synthetic caption.
arXiv Detail & Related papers (2023-08-16T15:19:52Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.