LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale
Image-Text Retrieval
- URL: http://arxiv.org/abs/2302.02908v1
- Date: Mon, 6 Feb 2023 16:24:41 GMT
- Title: LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale
Image-Text Retrieval
- Authors: Ziyang luo, Pu Zhao, Can Xu, Xiubo Geng, Tao Shen, Chongyang Tao, Jing
Ma, Qingwen lin, Daxin Jiang
- Abstract summary: The conventional dense retrieval paradigm relies on encoding images and texts into dense representations using dual-stream encoders.
We propose the lexicon-weighting paradigm, where sparse representations in vocabulary space are learned for images and texts.
We introduce a novel pre-training framework, that learns importance-aware lexicon representations.
Our framework achieves a 5.5 221.3X faster retrieval speed and 13.2 48.8X less index storage memory.
- Score: 71.01982683581572
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image-text retrieval (ITR) is a task to retrieve the relevant images/texts,
given the query from another modality. The conventional dense retrieval
paradigm relies on encoding images and texts into dense representations using
dual-stream encoders, however, it faces challenges with low retrieval speed in
large-scale retrieval scenarios. In this work, we propose the lexicon-weighting
paradigm, where sparse representations in vocabulary space are learned for
images and texts to take advantage of the bag-of-words models and efficient
inverted indexes, resulting in significantly reduced retrieval latency. A
crucial gap arises from the continuous nature of image data, and the
requirement for a sparse vocabulary space representation. To bridge this gap,
we introduce a novel pre-training framework, Lexicon-Bottlenecked
Language-Image Pre-Training (LexLIP), that learns importance-aware lexicon
representations. This framework features lexicon-bottlenecked modules between
the dual-stream encoders and weakened text decoders, allowing for constructing
continuous bag-of-words bottlenecks to learn lexicon-importance distributions.
Upon pre-training with same-scale data, our LexLIP achieves state-of-the-art
performance on two benchmark ITR datasets, MSCOCO and Flickr30k. Furthermore,
in large-scale retrieval scenarios, LexLIP outperforms CLIP with a 5.5 ~ 221.3X
faster retrieval speed and 13.2 ~ 48.8X less index storage memory.
Related papers
- Optimizing CLIP Models for Image Retrieval with Maintained Joint-Embedding Alignment [0.7499722271664144]
Contrastive Language and Image Pairing (CLIP) is a transformative method in multimedia retrieval.
CLIP typically trains two neural networks concurrently to generate joint embeddings for text and image pairs.
This paper addresses the challenge of optimizing CLIP models for various image-based similarity search scenarios.
arXiv Detail & Related papers (2024-09-03T14:33:01Z) - Rethinking Sparse Lexical Representations for Image Retrieval in the Age of Rising Multi-Modal Large Language Models [2.3301643766310374]
By utilizing multi-modal large language models (M-LLMs) that support visual prompting, we can extract image features and convert them into textual data.
We show the superior precision and recall performance of our image retrieval method compared to conventional vision-language model-based methods.
We also demonstrate that the retrieval performance can be improved by iteratively incorporating keywords into search queries.
arXiv Detail & Related papers (2024-08-29T06:54:03Z) - Fine-tuning CLIP Text Encoders with Two-step Paraphrasing [83.3736789315201]
We introduce a straightforward fine-tuning approach to enhance the representations of CLIP models for paraphrases.
Our model, which we call ParaCLIP, exhibits significant improvements over baseline CLIP models across various tasks.
arXiv Detail & Related papers (2024-02-23T06:11:50Z) - DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via
Word-Region Alignment [104.54362490182335]
DetCLIPv2 is an efficient training framework that incorporates large-scale image-text pairs to achieve open-vocabulary object detection.
DetCLIPv2 directly learns the fine-grained word-region alignment from massive image-text pairs in an end-to-end manner.
With 13M image-text pairs for pre-training, DetCLIPv2 demonstrates superior open-vocabulary detection performance.
arXiv Detail & Related papers (2023-04-10T11:08:15Z) - STAIR: Learning Sparse Text and Image Representation in Grounded Tokens [84.14528645941128]
We show that it is possible to build a sparse semantic representation that is as powerful as, or even better than, dense presentations.
We extend the CLIP model and build a sparse text and image representation (STAIR), where the image and text are mapped to a sparse token space.
It significantly outperforms a CLIP model with +$4.9%$ and +$4.3%$ absolute Recall@1 improvement.
arXiv Detail & Related papers (2023-01-30T17:21:30Z) - Leaner and Faster: Two-Stage Model Compression for Lightweight
Text-Image Retrieval [18.088550230146247]
Current text-image approaches (e.g., CLIP) typically adopt dual-encoder architecture us- ing pre-trained vision-language representation.
We present an effective two-stage framework to compress large pre-trained dual-encoder for lightweight text-image retrieval.
arXiv Detail & Related papers (2022-04-29T07:29:06Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.