STAIR: Learning Sparse Text and Image Representation in Grounded Tokens
- URL: http://arxiv.org/abs/2301.13081v1
- Date: Mon, 30 Jan 2023 17:21:30 GMT
- Title: STAIR: Learning Sparse Text and Image Representation in Grounded Tokens
- Authors: Chen Chen, Bowen Zhang, Liangliang Cao, Jiguang Shen, Tom Gunter,
Albin Madappally Jose, Alexander Toshev, Jonathon Shlens, Ruoming Pang,
Yinfei Yang
- Abstract summary: We show that it is possible to build a sparse semantic representation that is as powerful as, or even better than, dense presentations.
We extend the CLIP model and build a sparse text and image representation (STAIR), where the image and text are mapped to a sparse token space.
It significantly outperforms a CLIP model with +$4.9%$ and +$4.3%$ absolute Recall@1 improvement.
- Score: 84.14528645941128
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image and text retrieval is one of the foundational tasks in the vision and
language domain with multiple real-world applications. State-of-the-art
approaches, e.g. CLIP, ALIGN, represent images and texts as dense embeddings
and calculate the similarity in the dense embedding space as the matching
score. On the other hand, sparse semantic features like bag-of-words models are
more interpretable, but believed to suffer from inferior accuracy than dense
representations. In this work, we show that it is possible to build a sparse
semantic representation that is as powerful as, or even better than, dense
presentations. We extend the CLIP model and build a sparse text and image
representation (STAIR), where the image and text are mapped to a sparse token
space. Each token in the space is a (sub-)word in the vocabulary, which is not
only interpretable but also easy to integrate with existing information
retrieval systems. STAIR model significantly outperforms a CLIP model with
+$4.9\%$ and +$4.3\%$ absolute Recall@1 improvement on COCO-5k
text$\rightarrow$image and image$\rightarrow$text retrieval respectively. It
also achieved better performance on both of ImageNet zero-shot and linear
probing compared to CLIP.
Related papers
- Finetuning CLIP to Reason about Pairwise Differences [52.028073305958074]
We propose an approach to train vision-language models such as CLIP in a contrastive manner to reason about differences in embedding space.
We first demonstrate that our approach yields significantly improved capabilities in ranking images by a certain attribute.
We also illustrate that the resulting embeddings obey a larger degree of geometric properties in embedding space.
arXiv Detail & Related papers (2024-09-15T13:02:14Z) - Exploring Simple Open-Vocabulary Semantic Segmentation [7.245983878396646]
Open-vocabulary semantic segmentation models aim to accurately assign a semantic label to each pixel in an image from a set of arbitrary open-vocabulary texts.
In this paper, we introduce S-Seg, a novel model that can achieve surprisingly strong performance without depending on any of the above elements.
arXiv Detail & Related papers (2024-01-22T18:59:29Z) - Improving fine-grained understanding in image-text pre-training [37.163228122323865]
We introduce SPARse Fine-grained Contrastive Alignment (SPARC), a simple method for pretraining more fine-grained multimodal representations from image-text pairs.
We show improved performance over competing approaches over both image-level tasks relying on coarse-grained information.
arXiv Detail & Related papers (2024-01-18T10:28:45Z) - Less is More: Removing Text-regions Improves CLIP Training Efficiency
and Robustness [19.77762574325687]
The CLIP (Contrastive Language-Image Pre-training) model and its variants are becoming the de facto backbone in many applications.
We discuss two effective approaches to improve the efficiency and robustness of CLIP training.
Our filter-based CLIP model demonstrates a top-1 accuracy of 68.78%, outperforming previous models whose accuracy was all below 50%.
arXiv Detail & Related papers (2023-05-08T23:47:07Z) - LD-ZNet: A Latent Diffusion Approach for Text-Based Image Segmentation [10.623430999818925]
We present a technique for segmenting real and AI-generated images using latent diffusion models (LDMs) trained on internet-scale datasets.
We show up to 6% improvement over standard baselines for text-to-image segmentation on natural images.
For AI-generated imagery, we show close to 20% improvement compared to state-of-the-art techniques.
arXiv Detail & Related papers (2023-03-22T06:55:01Z) - I2DFormer: Learning Image to Document Attention for Zero-Shot Image
Classification [123.90912800376039]
Online textual documents, e.g., Wikipedia, contain rich visual descriptions about object classes.
We propose I2DFormer, a novel transformer-based ZSL framework that jointly learns to encode images and documents.
Our method leads to highly interpretable results where document words can be grounded in the image regions.
arXiv Detail & Related papers (2022-09-21T12:18:31Z) - DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting [91.56988987393483]
We present a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP.
Specifically, we convert the original image-text matching problem in CLIP to a pixel-text matching problem and use the pixel-text score maps to guide the learning of dense prediction models.
Our method is model-agnostic, which can be applied to arbitrary dense prediction systems and various pre-trained visual backbones.
arXiv Detail & Related papers (2021-12-02T18:59:32Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z) - Seed the Views: Hierarchical Semantic Alignment for Contrastive
Representation Learning [116.91819311885166]
We propose a hierarchical semantic alignment strategy via expanding the views generated by a single image to textbfCross-samples and Multi-level representation.
Our method, termed as CsMl, has the ability to integrate multi-level visual representations across samples in a robust way.
arXiv Detail & Related papers (2020-12-04T17:26:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.