Leaner and Faster: Two-Stage Model Compression for Lightweight
Text-Image Retrieval
- URL: http://arxiv.org/abs/2204.13913v1
- Date: Fri, 29 Apr 2022 07:29:06 GMT
- Title: Leaner and Faster: Two-Stage Model Compression for Lightweight
Text-Image Retrieval
- Authors: Siyu Ren, Kenny Q. Zhu
- Abstract summary: Current text-image approaches (e.g., CLIP) typically adopt dual-encoder architecture us- ing pre-trained vision-language representation.
We present an effective two-stage framework to compress large pre-trained dual-encoder for lightweight text-image retrieval.
- Score: 18.088550230146247
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Current text-image approaches (e.g., CLIP) typically adopt dual-encoder
architecture us- ing pre-trained vision-language representation. However, these
models still pose non-trivial memory requirements and substantial incre- mental
indexing time, which makes them less practical on mobile devices. In this
paper, we present an effective two-stage framework to compress large
pre-trained dual-encoder for lightweight text-image retrieval. The result- ing
model is smaller (39% of the original), faster (1.6x/2.9x for processing
image/text re- spectively), yet performs on par with or bet- ter than the
original full model on Flickr30K and MSCOCO benchmarks. We also open- source an
accompanying realistic mobile im- age search application.
Related papers
- Adapting Dual-encoder Vision-language Models for Paraphrased Retrieval [55.90407811819347]
We consider the task of paraphrased text-to-image retrieval where a model aims to return similar results given a pair of paraphrased queries.
We train a dual-encoder model starting from a language model pretrained on a large text corpus.
Compared to public dual-encoder models such as CLIP and OpenCLIP, the model trained with our best adaptation strategy achieves a significantly higher ranking similarity for paraphrased queries.
arXiv Detail & Related papers (2024-05-06T06:30:17Z) - CFIR: Fast and Effective Long-Text To Image Retrieval for Large Corpora [3.166549403591528]
This paper presents a two-stage Coarse-to-Fine Index-shared Retrieval (CFIR) framework, designed for fast and effective long-text to image retrieval.
CFIR surpasses existing MLLMs by up to 11.06% in Recall@1000, while reducing training and retrieval times by 68.75% and 99.79%, respectively.
arXiv Detail & Related papers (2024-02-23T11:47:16Z) - LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale
Image-Text Retrieval [71.01982683581572]
The conventional dense retrieval paradigm relies on encoding images and texts into dense representations using dual-stream encoders.
We propose the lexicon-weighting paradigm, where sparse representations in vocabulary space are learned for images and texts.
We introduce a novel pre-training framework, that learns importance-aware lexicon representations.
Our framework achieves a 5.5 221.3X faster retrieval speed and 13.2 48.8X less index storage memory.
arXiv Detail & Related papers (2023-02-06T16:24:41Z) - Efficient Image Captioning for Edge Devices [8.724184244203892]
We propose LightCap, a lightweight image captioner for resource-limited devices.
The core design is built on the recent CLIP model for efficient image captioning.
With the carefully designed architecture, our model merely contains 40M parameters, saving the model size by more than 75% and the FLOPs by more than 98%.
arXiv Detail & Related papers (2022-12-18T01:56:33Z) - GIT: A Generative Image-to-text Transformer for Vision and Language [138.91581326369837]
We train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering.
Our model surpasses the human performance for the first time on TextCaps (138.2 vs. 125.5 in CIDEr)
arXiv Detail & Related papers (2022-05-27T17:03:38Z) - HiVLP: Hierarchical Vision-Language Pre-Training for Fast Image-Text
Retrieval [85.28292877465353]
This paper proposes a textbfHierarchical textbfVision-textbfLanguage textbfPre-Training for fast Image-Text Retrieval (ITR)
Specifically, we design a novel hierarchical retrieval objective, which uses the representation of different dimensions for coarse-to-fine ITR.
arXiv Detail & Related papers (2022-05-24T14:32:57Z) - LAFITE: Towards Language-Free Training for Text-to-Image Generation [83.2935513540494]
We propose the first work to train text-to-image generation models without any text data.
Our method leverages the well-aligned multi-modal semantic space of the powerful pre-trained CLIP model.
We obtain state-of-the-art results in the standard text-to-image generation tasks.
arXiv Detail & Related papers (2021-11-27T01:54:45Z) - Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with
Transformers [115.90778814368703]
Our objective is language-based search of large-scale image and video datasets.
For this task, the approach that consists of independently mapping text and vision to a joint embedding space, a.k.a. dual encoders, is attractive as retrieval scales.
An alternative approach of using vision-text transformers with cross-attention gives considerable improvements in accuracy over the joint embeddings.
arXiv Detail & Related papers (2021-03-30T17:57:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.