Efficient Image-Text Retrieval via Keyword-Guided Pre-Screening
- URL: http://arxiv.org/abs/2303.07740v1
- Date: Tue, 14 Mar 2023 09:36:42 GMT
- Title: Efficient Image-Text Retrieval via Keyword-Guided Pre-Screening
- Authors: Min Cao, Yang Bai, Jingyao Wang, Ziqiang Cao, Liqiang Nie, Min Zhang
- Abstract summary: Current image-text retrieval methods suffer from $N$-related time complexity.
This paper presents a simple and effective keyword-guided pre-screening framework for the image-text retrieval.
- Score: 53.1711708318581
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Under the flourishing development in performance, current image-text
retrieval methods suffer from $N$-related time complexity, which hinders their
application in practice. Targeting at efficiency improvement, this paper
presents a simple and effective keyword-guided pre-screening framework for the
image-text retrieval. Specifically, we convert the image and text data into the
keywords and perform the keyword matching across modalities to exclude a large
number of irrelevant gallery samples prior to the retrieval network. For the
keyword prediction, we transfer it into a multi-label classification problem
and propose a multi-task learning scheme by appending the multi-label
classifiers to the image-text retrieval network to achieve a lightweight and
high-performance keyword prediction. For the keyword matching, we introduce the
inverted index in the search engine and create a win-win situation on both time
and space complexities for the pre-screening. Extensive experiments on two
widely-used datasets, i.e., Flickr30K and MS-COCO, verify the effectiveness of
the proposed framework. The proposed framework equipped with only two embedding
layers achieves $O(1)$ querying time complexity, while improving the retrieval
efficiency and keeping its performance, when applied prior to the common
image-text retrieval methods. Our code will be released.
Related papers
- Optimizing CLIP Models for Image Retrieval with Maintained Joint-Embedding Alignment [0.7499722271664144]
Contrastive Language and Image Pairing (CLIP) is a transformative method in multimedia retrieval.
CLIP typically trains two neural networks concurrently to generate joint embeddings for text and image pairs.
This paper addresses the challenge of optimizing CLIP models for various image-based similarity search scenarios.
arXiv Detail & Related papers (2024-09-03T14:33:01Z) - Rethinking Sparse Lexical Representations for Image Retrieval in the Age of Rising Multi-Modal Large Language Models [2.3301643766310374]
By utilizing multi-modal large language models (M-LLMs) that support visual prompting, we can extract image features and convert them into textual data.
We show the superior precision and recall performance of our image retrieval method compared to conventional vision-language model-based methods.
We also demonstrate that the retrieval performance can be improved by iteratively incorporating keywords into search queries.
arXiv Detail & Related papers (2024-08-29T06:54:03Z) - Unified Text-to-Image Generation and Retrieval [96.72318842152148]
We propose a unified framework in the context of Multimodal Large Language Models (MLLMs)
We first explore the intrinsic discrimi abilities of MLLMs and introduce a generative retrieval method to perform retrieval in a training-free manner.
We then unify generation and retrieval in an autoregressive generation way and propose an autonomous decision module to choose the best-matched one between generated and retrieved images.
arXiv Detail & Related papers (2024-06-09T15:00:28Z) - LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale
Image-Text Retrieval [71.01982683581572]
The conventional dense retrieval paradigm relies on encoding images and texts into dense representations using dual-stream encoders.
We propose the lexicon-weighting paradigm, where sparse representations in vocabulary space are learned for images and texts.
We introduce a novel pre-training framework, that learns importance-aware lexicon representations.
Our framework achieves a 5.5 221.3X faster retrieval speed and 13.2 48.8X less index storage memory.
arXiv Detail & Related papers (2023-02-06T16:24:41Z) - ALADIN: Distilling Fine-grained Alignment Scores for Efficient
Image-Text Matching and Retrieval [51.588385824875886]
Cross-modal retrieval consists in finding images related to a given query text or vice-versa.
Many recent methods proposed effective solutions to the image-text matching problem, mostly using recent large vision-language (VL) Transformer networks.
This paper proposes an ALign And DIstill Network (ALADIN) to fill in the gap between effectiveness and efficiency.
arXiv Detail & Related papers (2022-07-29T16:01:48Z) - Progressive Learning for Image Retrieval with Hybrid-Modality Queries [48.79599320198615]
Image retrieval with hybrid-modality queries, also known as composing text and image for image retrieval (CTI-IR)
We decompose the CTI-IR task into a three-stage learning problem to progressively learn the complex knowledge for image retrieval with hybrid-modality queries.
Our proposed model significantly outperforms state-of-the-art methods in the mean of Recall@K by 24.9% and 9.5% on the Fashion-IQ and Shoes benchmark datasets respectively.
arXiv Detail & Related papers (2022-04-24T08:10:06Z) - ARTEMIS: Attention-based Retrieval with Text-Explicit Matching and
Implicit Similarity [16.550790981646276]
Current approaches combine the features of each of the two elements of the query into a single representation.
Our work aims at shedding new light on the task by looking at it through the prism of two familiar and related frameworks: text-to-image and image-to-image retrieval.
arXiv Detail & Related papers (2022-03-15T17:29:20Z) - Towards Efficient Cross-Modal Visual Textual Retrieval using
Transformer-Encoder Deep Features [10.163477961551592]
Cross-modal retrieval is an important functionality in modern search engines.
In this paper, we focus on the image-sentence retrieval task.
We use the recently introduced TERN architecture as an image-sentence features extractor.
arXiv Detail & Related papers (2021-06-01T10:11:46Z) - Retrieve Fast, Rerank Smart: Cooperative and Joint Approaches for
Improved Cross-Modal Retrieval [80.35589927511667]
Current state-of-the-art approaches to cross-modal retrieval process text and visual input jointly, relying on Transformer-based architectures with cross-attention mechanisms that attend over all words and objects in an image.
We propose a novel fine-tuning framework which turns any pretrained text-image multi-modal model into an efficient retrieval model.
Our experiments on a series of standard cross-modal retrieval benchmarks in monolingual, multilingual, and zero-shot setups, demonstrate improved accuracy and huge efficiency benefits over the state-of-the-art cross-encoders.
arXiv Detail & Related papers (2021-03-22T15:08:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.