Zero-Shot Composed Image Retrieval with Textual Inversion
- URL: http://arxiv.org/abs/2303.15247v2
- Date: Sat, 19 Aug 2023 14:04:41 GMT
- Title: Zero-Shot Composed Image Retrieval with Textual Inversion
- Authors: Alberto Baldrati, Lorenzo Agnolucci, Marco Bertini, Alberto Del Bimbo
- Abstract summary: Composed Image Retrieval (CIR) aims to retrieve a target image based on a query composed of a reference image and a relative caption.
We propose a new task, Zero-Shot CIR (ZS-CIR), that aims to address CIR without requiring a labeled training dataset.
- Score: 28.513594970580396
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Composed Image Retrieval (CIR) aims to retrieve a target image based on a
query composed of a reference image and a relative caption that describes the
difference between the two images. The high effort and cost required for
labeling datasets for CIR hamper the widespread usage of existing methods, as
they rely on supervised learning. In this work, we propose a new task,
Zero-Shot CIR (ZS-CIR), that aims to address CIR without requiring a labeled
training dataset. Our approach, named zero-Shot composEd imAge Retrieval with
textuaL invErsion (SEARLE), maps the visual features of the reference image
into a pseudo-word token in CLIP token embedding space and integrates it with
the relative caption. To support research on ZS-CIR, we introduce an
open-domain benchmarking dataset named Composed Image Retrieval on Common
Objects in context (CIRCO), which is the first dataset for CIR containing
multiple ground truths for each query. The experiments show that SEARLE
exhibits better performance than the baselines on the two main datasets for CIR
tasks, FashionIQ and CIRR, and on the proposed CIRCO. The dataset, the code and
the model are publicly available at https://github.com/miccunifi/SEARLE.
Related papers
- Training-free Zero-shot Composed Image Retrieval via Weighted Modality Fusion and Similarity [2.724141845301679]
Composed image retrieval (CIR) formulates the query as a combination of a reference image and modified text.
We introduce a training-free approach for ZS-CIR.
Our approach is simple, easy to implement, and its effectiveness is validated through experiments on the FashionIQ and CIRR datasets.
arXiv Detail & Related papers (2024-09-07T21:52:58Z) - HyCIR: Boosting Zero-Shot Composed Image Retrieval with Synthetic Labels [5.34016463729574]
Composed Image Retrieval (CIR) aims to retrieve images based on a query image with text.
Current Zero-Shot CIR (ZS-CIR) methods try to solve CIR tasks without using expensive triplet-labeled training datasets.
We propose Hybrid CIR (HyCIR), which uses synthetic labels to boost the performance of ZS-CIR.
arXiv Detail & Related papers (2024-07-08T09:55:36Z) - iSEARLE: Improving Textual Inversion for Zero-Shot Composed Image Retrieval [26.101116761577796]
Composed Image Retrieval (CIR) aims to retrieve target images visually similar to the reference one while incorporating the changes specified in the relative caption.
We introduce a new task, Zero-Shot CIR (ZS-CIR), that addresses CIR without the need for a labeled training dataset.
We present an open-domain benchmarking dataset named CIRCO, where each query is labeled with multiple ground truths and a semantic categorization.
arXiv Detail & Related papers (2024-05-05T14:39:06Z) - Spherical Linear Interpolation and Text-Anchoring for Zero-shot Composed Image Retrieval [43.47770490199544]
Composed Image Retrieval (CIR) is a complex task that retrieves images using a query, which is configured with an image and a caption.
We introduce a novel ZS-CIR method that uses Spherical Linear Interpolation (Slerp) to directly merge image and text representations.
We also introduce Text-Anchored-Tuning (TAT), a method that fine-tunes the image encoder while keeping the text encoder fixed.
arXiv Detail & Related papers (2024-05-01T15:19:54Z) - Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval [50.72924579220149]
Composed Image Retrieval (CIR) is a task that retrieves images similar to a query, based on a provided textual modification.
Current techniques rely on supervised learning for CIR models using labeled triplets of the reference image, text, target image.
We propose a new semi-supervised CIR approach where we search for a reference and its related target images in auxiliary data.
arXiv Detail & Related papers (2024-04-23T21:00:22Z) - Knowledge-Enhanced Dual-stream Zero-shot Composed Image Retrieval [53.89454443114146]
We study the zero-shot Composed Image Retrieval (ZS-CIR) task, which is to retrieve the target image given a reference image and a description without training on the triplet datasets.
Previous works generate pseudo-word tokens by projecting the reference image features to the text embedding space.
We propose a Knowledge-Enhanced Dual-stream zero-shot composed image retrieval framework (KEDs)
KEDs implicitly models the attributes of the reference images by incorporating a database.
arXiv Detail & Related papers (2024-03-24T04:23:56Z) - Language-only Efficient Training of Zero-shot Composed Image Retrieval [46.93446891158521]
Composed image retrieval (CIR) task takes a composed query of image and text, aiming to search relative images for both conditions.
We propose a novel CIR framework, only using language for its training.
Our LinCIR (Language-only training for CIR) can be trained only with text datasets by a novel self-supervision named self-masking projection (SMP)
arXiv Detail & Related papers (2023-12-04T16:22:06Z) - Vision-by-Language for Training-Free Compositional Image Retrieval [78.60509831598745]
Compositional Image Retrieval (CIR) aims to retrieve the relevant target image in a database.
Recent research sidesteps this need by using large-scale vision-language models (VLMs)
We propose to tackle CIR in a training-free manner via Vision-by-Language (CIReVL)
arXiv Detail & Related papers (2023-10-13T17:59:38Z) - Sentence-level Prompts Benefit Composed Image Retrieval [69.78119883060006]
Composed image retrieval (CIR) is the task of retrieving specific images by using a query that involves both a reference image and a relative caption.
We propose to leverage pretrained V-L models, e.g., BLIP-2, to generate sentence-level prompts.
Our proposed method performs favorably against the state-of-the-art CIR methods on the Fashion-IQ and CIRR datasets.
arXiv Detail & Related papers (2023-10-09T07:31:44Z) - Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image
Retrieval [84.11127588805138]
Composed Image Retrieval (CIR) combines a query image with text to describe their intended target.
Existing methods rely on supervised learning of CIR models using labeled triplets consisting of the query image, text specification, and the target image.
We propose Zero-Shot Composed Image Retrieval (ZS-CIR), whose goal is to build a CIR model without requiring labeled triplets for training.
arXiv Detail & Related papers (2023-02-06T19:40:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.