HiVLP: Hierarchical Vision-Language Pre-Training for Fast Image-Text
Retrieval
- URL: http://arxiv.org/abs/2205.12105v1
- Date: Tue, 24 May 2022 14:32:57 GMT
- Title: HiVLP: Hierarchical Vision-Language Pre-Training for Fast Image-Text
Retrieval
- Authors: Feilong Chen and Xiuyi Chen and Jiaxin Shi and Duzhen Zhang and
Jianlong Chang and Qi Tian
- Abstract summary: This paper proposes a textbfHierarchical textbfVision-textbfLanguage textbfPre-Training for fast Image-Text Retrieval (ITR)
Specifically, we design a novel hierarchical retrieval objective, which uses the representation of different dimensions for coarse-to-fine ITR.
- Score: 85.28292877465353
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the past few years, the emergence of vision-language pre-training (VLP)
has brought cross-modal retrieval to a new era. However, due to the latency and
computation demand, it is commonly challenging to apply VLP in a real-time
online retrieval system. To alleviate the defect, this paper proposes a
\textbf{Hi}erarchical \textbf{V}ision-\textbf{}Language \textbf{P}re-Training
(\textbf{HiVLP}) for fast Image-Text Retrieval (ITR). Specifically, we design a
novel hierarchical retrieval objective, which uses the representation of
different dimensions for coarse-to-fine ITR, i.e., using low-dimensional
representation for large-scale coarse retrieval and high-dimensional
representation for small-scale fine retrieval. We evaluate our proposed HiVLP
on two popular image-text retrieval benchmarks, i.e., Flickr30k and COCO.
Extensive experiments demonstrate that our HiVLP not only has fast inference
speed but also can be easily scaled to large-scale ITR scenarios. The detailed
results show that HiVLP is $1,427$$\sim$$120,649\times$ faster than the
fusion-based model UNITER and 2$\sim$5 faster than the fastest embedding-based
model LightingDot in different candidate scenarios. It also achieves about +4.9
AR on COCO and +3.8 AR on Flickr30K than LightingDot and achieves comparable
performance with the state-of-the-art (SOTA) fusion-based model METER.
Related papers
- Multimodal Learned Sparse Retrieval with Probabilistic Expansion Control [66.78146440275093]
Learned retrieval (LSR) is a family of neural methods that encode queries and documents into sparse lexical vectors.
We explore the application of LSR to the multi-modal domain, with a focus on text-image retrieval.
Current approaches like LexLIP and STAIR require complex multi-step training on massive datasets.
Our proposed approach efficiently transforms dense vectors from a frozen dense model into sparse lexical vectors.
arXiv Detail & Related papers (2024-02-27T14:21:56Z) - CFIR: Fast and Effective Long-Text To Image Retrieval for Large Corpora [3.166549403591528]
This paper presents a two-stage Coarse-to-Fine Index-shared Retrieval (CFIR) framework, designed for fast and effective long-text to image retrieval.
CFIR surpasses existing MLLMs by up to 11.06% in Recall@1000, while reducing training and retrieval times by 68.75% and 99.79%, respectively.
arXiv Detail & Related papers (2024-02-23T11:47:16Z) - Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection [66.72992463712299]
Vision Transformers (ViTs) have become increasingly popular in large-scale Vision and Language Pre-training models.
Previous research has demonstrated the efficacy of ViTs, but they still struggle with computational inefficiencies caused by lengthy visual sequences.
We introduce TRIPS, which reduces the visual sequence using a text-guided patch-selection layer in the visual backbone.
Our experimental results reveal that TRIPS delivers a 40% speedup, while maintaining competitive or superior performance on downstream tasks.
arXiv Detail & Related papers (2024-01-11T14:31:30Z) - PaLI-3 Vision Language Models: Smaller, Faster, Stronger [82.6453282241224]
PaLI-3 is a smaller, faster, and stronger vision language model (VLM) that compares favorably to similar models that are 10x larger.
We find that, while slightly underperforming on standard image classification benchmarks, SigLIP-based PaLI shows superior performance across various multimodal benchmarks.
arXiv Detail & Related papers (2023-10-13T15:45:19Z) - RLIPv2: Fast Scaling of Relational Language-Image Pre-training [53.21796397618875]
We propose RLIPv2, a fast converging model that enables the relational scaling of pre-training to large-scale pseudo-labelled scene graph data.
Asymmetric Language-Image Fusion (ALIF) facilitates earlier and deeper gated cross-modal fusion with sparsified language encoding.
RLIPv2 shows state-of-the-art performance on three benchmarks under fully-finetuning, few-shot and zero-shot settings.
arXiv Detail & Related papers (2023-08-18T07:17:09Z) - ALADIN: Distilling Fine-grained Alignment Scores for Efficient
Image-Text Matching and Retrieval [51.588385824875886]
Cross-modal retrieval consists in finding images related to a given query text or vice-versa.
Many recent methods proposed effective solutions to the image-text matching problem, mostly using recent large vision-language (VL) Transformer networks.
This paper proposes an ALign And DIstill Network (ALADIN) to fill in the gap between effectiveness and efficiency.
arXiv Detail & Related papers (2022-07-29T16:01:48Z) - Leaner and Faster: Two-Stage Model Compression for Lightweight
Text-Image Retrieval [18.088550230146247]
Current text-image approaches (e.g., CLIP) typically adopt dual-encoder architecture us- ing pre-trained vision-language representation.
We present an effective two-stage framework to compress large pre-trained dual-encoder for lightweight text-image retrieval.
arXiv Detail & Related papers (2022-04-29T07:29:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.