Compositional Learning of Image-Text Query for Image Retrieval
- URL: http://arxiv.org/abs/2006.11149v3
- Date: Mon, 31 May 2021 21:35:55 GMT
- Title: Compositional Learning of Image-Text Query for Image Retrieval
- Authors: Muhammad Umer Anwaar, Egor Labintcev, Martin Kleinsteuber
- Abstract summary: We propose an autoencoder based model, ComposeAE, to learn the composition of image and text query for retrieving images.
We adopt a deep metric learning approach and learn a metric that pushes composition of source image and text query closer to the target images.
- Score: 3.9348884623092517
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we investigate the problem of retrieving images from a
database based on a multi-modal (image-text) query. Specifically, the query
text prompts some modification in the query image and the task is to retrieve
images with the desired modifications. For instance, a user of an E-Commerce
platform is interested in buying a dress, which should look similar to her
friend's dress, but the dress should be of white color with a ribbon sash. In
this case, we would like the algorithm to retrieve some dresses with desired
modifications in the query dress. We propose an autoencoder based model,
ComposeAE, to learn the composition of image and text query for retrieving
images. We adopt a deep metric learning approach and learn a metric that pushes
composition of source image and text query closer to the target images. We also
propose a rotational symmetry constraint on the optimization problem. Our
approach is able to outperform the state-of-the-art method TIRG \cite{TIRG} on
three benchmark datasets, namely: MIT-States, Fashion200k and Fashion IQ. In
order to ensure fair comparison, we introduce strong baselines by enhancing
TIRG method. To ensure reproducibility of the results, we publish our code
here: \url{https://github.com/ecom-research/ComposeAE}.
Related papers
- Imagine and Seek: Improving Composed Image Retrieval with an Imagined Proxy [23.041812897803034]
The Zero-shot Composed Image Retrieval (ZSCIR) requires retrieving images that match the query image and the relative captions.
We introduce Imagined Proxy for CIR (IP-CIR), a training-free method that creates a proxy image aligned with the query image and text description.
Our newly proposed balancing metric integrates text-based and proxy retrieval similarities, allowing for more accurate retrieval of the target image.
arXiv Detail & Related papers (2024-11-24T05:27:21Z) - Zero-shot Composed Image Retrieval Considering Query-target Relationship Leveraging Masked Image-text Pairs [44.48400303207482]
The objective of a zero-shot composed image retrieval (CIR) is to retrieve the target image using a query image and a query text.
Existing methods use a textual inversion network to convert the query image into a pseudo word to compose the image and text.
We propose a novel zero-shot CIR method that is trained end-to-end using masked image-text pairs.
arXiv Detail & Related papers (2024-06-27T02:10:30Z) - MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions [64.89284104414865]
We introduce MagicLens, a series of self-supervised image retrieval models that support open-ended instructions.
MagicLens is built on a key novel insight: image pairs that naturally occur on the same web pages contain a wide range of implicit relations.
MagicLens achieves results comparable with or better than prior best on eight benchmarks of various image retrieval tasks.
arXiv Detail & Related papers (2024-03-28T17:59:20Z) - Sentence-level Prompts Benefit Composed Image Retrieval [69.78119883060006]
Composed image retrieval (CIR) is the task of retrieving specific images by using a query that involves both a reference image and a relative caption.
We propose to leverage pretrained V-L models, e.g., BLIP-2, to generate sentence-level prompts.
Our proposed method performs favorably against the state-of-the-art CIR methods on the Fashion-IQ and CIRR datasets.
arXiv Detail & Related papers (2023-10-09T07:31:44Z) - Exposing and Mitigating Spurious Correlations for Cross-Modal Retrieval [89.30660533051514]
Cross-modal retrieval methods are the preferred tool to search databases for the text that best matches a query image and vice versa.
Image-text retrieval models commonly learn to spurious correlations in the training data, such as frequent object co-occurrence.
We introduce ODmAP@k, an object decorrelation metric that measures a model's robustness to spurious correlations in the training data.
arXiv Detail & Related papers (2023-04-06T21:45:46Z) - Bi-directional Training for Composed Image Retrieval via Text Prompt
Learning [46.60334745348141]
Composed image retrieval searches for a target image based on a multi-modal user query comprised of a reference image and modification text.
We propose a bi-directional training scheme that leverages such reversed queries and can be applied to existing composed image retrieval architectures.
Experiments on two standard datasets show that our novel approach achieves improved performance over a baseline BLIP-based model.
arXiv Detail & Related papers (2023-03-29T11:37:41Z) - Progressive Learning for Image Retrieval with Hybrid-Modality Queries [48.79599320198615]
Image retrieval with hybrid-modality queries, also known as composing text and image for image retrieval (CTI-IR)
We decompose the CTI-IR task into a three-stage learning problem to progressively learn the complex knowledge for image retrieval with hybrid-modality queries.
Our proposed model significantly outperforms state-of-the-art methods in the mean of Recall@K by 24.9% and 9.5% on the Fashion-IQ and Shoes benchmark datasets respectively.
arXiv Detail & Related papers (2022-04-24T08:10:06Z) - Embedding Arithmetic for Text-driven Image Transformation [48.7704684871689]
Text representations exhibit geometric regularities, such as the famous analogy: queen is to king what woman is to man.
Recent works aim at bridging this semantic gap embed images and text into a multimodal space.
We introduce the SIMAT dataset to evaluate the task of text-driven image transformation.
arXiv Detail & Related papers (2021-12-06T16:51:50Z) - RTIC: Residual Learning for Text and Image Composition using Graph
Convolutional Network [19.017377597937617]
We study the compositional learning of images and texts for image retrieval.
We introduce a novel method that combines the graph convolutional network (GCN) with existing composition methods.
arXiv Detail & Related papers (2021-04-07T09:41:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.