Related papers: TIPS: Text-Image Pretraining with Spatial Awareness

TIPS: Text-Image Pretraining with Spatial Awareness

URL: http://arxiv.org/abs/2410.16512v1
Date: Mon, 21 Oct 2024 21:05:04 GMT
Title: TIPS: Text-Image Pretraining with Spatial Awareness
Authors: Kevis-Kokitsi Maninis, Kaifeng Chen, Soham Ghosh, Arjun Karpur, Koert Chen, Ye Xia, Bingyi Cao, Daniel Salz, Guangxing Han, Jan Dlabal, Dan Gnanapragasam, Mojtaba Seyedhosseini, Howard Zhou, Andre Araujo,
Abstract summary: Self-supervised image-only pretraining is still the go-to method for many vision applications. We propose a novel general-purpose image-text model, which can be effectively used off-the-shelf for dense and global vision tasks.
Score: 13.38247732379754
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While image-text representation learning has become very popular in recent years, existing models tend to lack spatial awareness and have limited direct applicability for dense understanding tasks. For this reason, self-supervised image-only pretraining is still the go-to method for many dense vision applications (e.g. depth estimation, semantic segmentation), despite the lack of explicit supervisory signals. In this paper, we close this gap between image-text and self-supervised learning, by proposing a novel general-purpose image-text model, which can be effectively used off-the-shelf for dense and global vision tasks. Our method, which we refer to as Text-Image Pretraining with Spatial awareness (TIPS), leverages two simple and effective insights. First, on textual supervision: we reveal that replacing noisy web image captions by synthetically generated textual descriptions boosts dense understanding performance significantly, due to a much richer signal for learning spatially aware representations. We propose an adapted training method that combines noisy and synthetic captions, resulting in improvements across both dense and global understanding tasks. Second, on the learning technique: we propose to combine contrastive image-text learning with self-supervised masked image modeling, to encourage spatial coherence, unlocking substantial enhancements for downstream applications. Building on these two ideas, we scale our model using the transformer architecture, trained on a curated set of public images. Our experiments are conducted on 8 tasks involving 16 datasets in total, demonstrating strong off-the-shelf performance on both dense and global understanding, for several image-only and image-text tasks.

Related papers

Unleashing In-context Learning of Autoregressive Models for Few-shot Image Manipulation [70.95783968368124]
We introduce a novel multi-modal autoregressive model, dubbed $textbfInstaManip$. We propose an innovative group self-attention mechanism to break down the in-context learning process into two separate stages. Our method surpasses previous few-shot image manipulation models by a notable margin.
arXiv Detail & Related papers (2024-12-02T01:19:21Z)
Compositional Entailment Learning for Hyperbolic Vision-Language Models [54.41927525264365]
We show how to fully leverage the innate hierarchical nature of hyperbolic embeddings by looking beyond individual image-text pairs. We propose Compositional Entailment Learning for hyperbolic vision-language models. Empirical evaluation on a hyperbolic vision-language model trained with millions of image-text pairs shows that the proposed compositional learning approach outperforms conventional Euclidean CLIP learning.
arXiv Detail & Related papers (2024-10-09T14:12:50Z)
Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality [50.48859793121308]
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning. Recent research has highlighted severe limitations in their ability to perform compositional reasoning over objects, attributes, and relations.
arXiv Detail & Related papers (2023-05-23T08:28:38Z)
Multi-Modal Representation Learning with Text-Driven Soft Masks [48.19806080407593]
We propose a visual-linguistic representation learning approach within a self-supervised learning framework. We generate diverse features for the image-text matching (ITM) task via soft-masking the regions in an image. We identify the relevant regions to each word by computing the word-conditional visual attention using multi-modal encoder.
arXiv Detail & Related papers (2023-04-03T05:07:49Z)
Universal Multimodal Representation for Language Understanding [110.98786673598015]
This work presents new methods to employ visual information as assistant signals to general NLP tasks. For each sentence, we first retrieve a flexible number of images either from a light topic-image lookup table extracted over the existing sentence-image pairs. Then, the text and images are encoded by a Transformer encoder and convolutional neural network, respectively.
arXiv Detail & Related papers (2023-01-09T13:54:11Z)
Fine-Grained Semantically Aligned Vision-Language Pre-Training [151.7372197904064]
Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks. Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts. We introduce LO, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions.
arXiv Detail & Related papers (2022-08-04T07:51:48Z)
UNIMO-2: End-to-End Unified Vision-Language Grounded Learning [46.914284894632]
We propose an end-to-end unified-modal pre-training framework, namely UNIMO-2. We build a unified Transformer model to jointly learn visual representations, textual representations and semantic alignment between images and texts. Our code and models are public at the UNIMO project page.
arXiv Detail & Related papers (2022-03-17T03:53:11Z)
Self-Supervised Image-to-Text and Text-to-Image Synthesis [23.587581181330123]
We propose a novel self-supervised deep learning based approach towards learning the cross-modal embedding spaces. In our approach, we first obtain dense vector representations of images using StackGAN-based autoencoder model and also dense vector representations on sentence-level utilizing LSTM based text-autoencoder.
arXiv Detail & Related papers (2021-12-09T13:54:56Z)
FILIP: Fine-grained Interactive Language-Image Pre-Training [106.19474076935363]
Fine-grained Interactive Language-Image Pre-training achieves finer-level alignment through a cross-modal late interaction mechanism. We construct a new large-scale image-text pair dataset called FILIP300M for pre-training. Experiments show that FILIP achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-11-09T17:15:38Z)
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation [52.40490994871753]
We introduce a contrastive loss to representations BEfore Fusing (ALBEF) through cross-modal attention. We propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model. ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-07-16T00:19:22Z)
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps. A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss. We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z)
Scene Text Synthesis for Efficient and Effective Deep Network Training [62.631176120557136]
We develop an innovative image synthesis technique that composes annotated training images by embedding foreground objects of interest into background images. The proposed technique consists of two key components that in principle boost the usefulness of the synthesized images in deep network training. Experiments over a number of public datasets demonstrate the effectiveness of our proposed image synthesis technique.
arXiv Detail & Related papers (2019-01-26T10:15:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.