Progressive Learning for Image Retrieval with Hybrid-Modality Queries
- URL: http://arxiv.org/abs/2204.11212v1
- Date: Sun, 24 Apr 2022 08:10:06 GMT
- Title: Progressive Learning for Image Retrieval with Hybrid-Modality Queries
- Authors: Yida Zhao, Yuqing Song, Qin Jin
- Abstract summary: Image retrieval with hybrid-modality queries, also known as composing text and image for image retrieval (CTI-IR)
We decompose the CTI-IR task into a three-stage learning problem to progressively learn the complex knowledge for image retrieval with hybrid-modality queries.
Our proposed model significantly outperforms state-of-the-art methods in the mean of Recall@K by 24.9% and 9.5% on the Fashion-IQ and Shoes benchmark datasets respectively.
- Score: 48.79599320198615
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image retrieval with hybrid-modality queries, also known as composing text
and image for image retrieval (CTI-IR), is a retrieval task where the search
intention is expressed in a more complex query format, involving both vision
and text modalities. For example, a target product image is searched using a
reference product image along with text about changing certain attributes of
the reference image as the query. It is a more challenging image retrieval task
that requires both semantic space learning and cross-modal fusion. Previous
approaches that attempt to deal with both aspects achieve unsatisfactory
performance. In this paper, we decompose the CTI-IR task into a three-stage
learning problem to progressively learn the complex knowledge for image
retrieval with hybrid-modality queries. We first leverage the semantic
embedding space for open-domain image-text retrieval, and then transfer the
learned knowledge to the fashion-domain with fashion-related pre-training
tasks. Finally, we enhance the pre-trained model from single-query to
hybrid-modality query for the CTI-IR task. Furthermore, as the contribution of
individual modality in the hybrid-modality query varies for different retrieval
scenarios, we propose a self-supervised adaptive weighting strategy to
dynamically determine the importance of image and text in the hybrid-modality
query for better retrieval. Extensive experiments show that our proposed model
significantly outperforms state-of-the-art methods in the mean of Recall@K by
24.9% and 9.5% on the Fashion-IQ and Shoes benchmark datasets respectively.
Related papers
- Training-free Zero-shot Composed Image Retrieval via Weighted Modality Fusion and Similarity [2.724141845301679]
Composed image retrieval (CIR) formulates the query as a combination of a reference image and modified text.
We introduce a training-free approach for ZS-CIR.
Our approach is simple, easy to implement, and its effectiveness is validated through experiments on the FashionIQ and CIRR datasets.
arXiv Detail & Related papers (2024-09-07T21:52:58Z) - Unified Text-to-Image Generation and Retrieval [96.72318842152148]
We propose a unified framework in the context of Multimodal Large Language Models (MLLMs)
We first explore the intrinsic discrimi abilities of MLLMs and introduce a generative retrieval method to perform retrieval in a training-free manner.
We then unify generation and retrieval in an autoregressive generation way and propose an autonomous decision module to choose the best-matched one between generated and retrieved images.
arXiv Detail & Related papers (2024-06-09T15:00:28Z) - Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models [17.171715290673678]
We propose an interactive image retrieval system capable of refining queries based on user relevance feedback.
This system incorporates a vision language model (VLM) based image captioner to enhance the quality of text-based queries.
To evaluate our system, we curate a new dataset by adapting the MSR-VTT video retrieval dataset to the image retrieval task.
arXiv Detail & Related papers (2024-04-29T14:46:35Z) - End-to-end Knowledge Retrieval with Multi-modal Queries [50.01264794081951]
ReMuQ requires a system to retrieve knowledge from a large corpus by integrating contents from both text and image queries.
We introduce a retriever model ReViz'' that can directly process input text and images to retrieve relevant knowledge in an end-to-end fashion.
We demonstrate superior performance in retrieval on two datasets under zero-shot settings.
arXiv Detail & Related papers (2023-06-01T08:04:12Z) - EDIS: Entity-Driven Image Search over Multimodal Web Content [95.40238328527931]
We introduce textbfEntity-textbfDriven textbfImage textbfSearch (EDIS), a dataset for cross-modal image search in the news domain.
EDIS consists of 1 million web images from actual search engine results and curated datasets, with each image paired with a textual description.
arXiv Detail & Related papers (2023-05-23T02:59:19Z) - BOSS: Bottom-up Cross-modal Semantic Composition with Hybrid
Counterfactual Training for Robust Content-based Image Retrieval [61.803481264081036]
Content-Based Image Retrieval (CIR) aims to search for a target image by concurrently comprehending the composition of an example image and a complementary text.
We tackle this task by a novel underlinetextbfBottom-up crunderlinetextbfOss-modal underlinetextbfSemantic compounderlinetextbfSition (textbfBOSS) with Hybrid Counterfactual Training framework.
arXiv Detail & Related papers (2022-07-09T07:14:44Z) - ARTEMIS: Attention-based Retrieval with Text-Explicit Matching and
Implicit Similarity [16.550790981646276]
Current approaches combine the features of each of the two elements of the query into a single representation.
Our work aims at shedding new light on the task by looking at it through the prism of two familiar and related frameworks: text-to-image and image-to-image retrieval.
arXiv Detail & Related papers (2022-03-15T17:29:20Z) - Cross-Modal Retrieval Augmentation for Multi-Modal Classification [61.5253261560224]
We explore the use of unstructured external knowledge sources of images and their corresponding captions for improving visual question answering.
First, we train a novel alignment model for embedding images and captions in the same space, which achieves substantial improvement on image-caption retrieval.
Second, we show that retrieval-augmented multi-modal transformers using the trained alignment model improve results on VQA over strong baselines.
arXiv Detail & Related papers (2021-04-16T13:27:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.