Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image
Retrieval
- URL: http://arxiv.org/abs/2302.03084v2
- Date: Mon, 15 May 2023 18:43:27 GMT
- Title: Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image
Retrieval
- Authors: Kuniaki Saito, Kihyuk Sohn, Xiang Zhang, Chun-Liang Li, Chen-Yu Lee,
Kate Saenko, Tomas Pfister
- Abstract summary: Composed Image Retrieval (CIR) combines a query image with text to describe their intended target.
Existing methods rely on supervised learning of CIR models using labeled triplets consisting of the query image, text specification, and the target image.
We propose Zero-Shot Composed Image Retrieval (ZS-CIR), whose goal is to build a CIR model without requiring labeled triplets for training.
- Score: 84.11127588805138
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In Composed Image Retrieval (CIR), a user combines a query image with text to
describe their intended target. Existing methods rely on supervised learning of
CIR models using labeled triplets consisting of the query image, text
specification, and the target image. Labeling such triplets is expensive and
hinders broad applicability of CIR. In this work, we propose to study an
important task, Zero-Shot Composed Image Retrieval (ZS-CIR), whose goal is to
build a CIR model without requiring labeled triplets for training. To this end,
we propose a novel method, called Pic2Word, that requires only weakly labeled
image-caption pairs and unlabeled image datasets to train. Unlike existing
supervised CIR models, our model trained on weakly labeled or unlabeled
datasets shows strong generalization across diverse ZS-CIR tasks, e.g.,
attribute editing, object composition, and domain conversion. Our approach
outperforms several supervised CIR methods on the common CIR benchmark, CIRR
and Fashion-IQ. Code will be made publicly available at
https://github.com/google-research/composed_image_retrieval.
Related papers
- HyCIR: Boosting Zero-Shot Composed Image Retrieval with Synthetic Labels [5.34016463729574]
Composed Image Retrieval (CIR) aims to retrieve images based on a query image with text.
Current Zero-Shot CIR (ZS-CIR) methods try to solve CIR tasks without using expensive triplet-labeled training datasets.
We propose Hybrid CIR (HyCIR), which uses synthetic labels to boost the performance of ZS-CIR.
arXiv Detail & Related papers (2024-07-08T09:55:36Z) - iSEARLE: Improving Textual Inversion for Zero-Shot Composed Image Retrieval [26.101116761577796]
Composed Image Retrieval (CIR) aims to retrieve target images visually similar to the reference one while incorporating the changes specified in the relative caption.
We introduce a new task, Zero-Shot CIR (ZS-CIR), that addresses CIR without the need for a labeled training dataset.
We present an open-domain benchmarking dataset named CIRCO, where each query is labeled with multiple ground truths and a semantic categorization.
arXiv Detail & Related papers (2024-05-05T14:39:06Z) - Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval [50.72924579220149]
Composed Image Retrieval (CIR) is a task that retrieves images similar to a query, based on a provided textual modification.
Current techniques rely on supervised learning for CIR models using labeled triplets of the reference image, text, target image.
We propose a new semi-supervised CIR approach where we search for a reference and its related target images in auxiliary data.
arXiv Detail & Related papers (2024-04-23T21:00:22Z) - Language-only Efficient Training of Zero-shot Composed Image Retrieval [46.93446891158521]
Composed image retrieval (CIR) task takes a composed query of image and text, aiming to search relative images for both conditions.
We propose a novel CIR framework, only using language for its training.
Our LinCIR (Language-only training for CIR) can be trained only with text datasets by a novel self-supervision named self-masking projection (SMP)
arXiv Detail & Related papers (2023-12-04T16:22:06Z) - Vision-by-Language for Training-Free Compositional Image Retrieval [78.60509831598745]
Compositional Image Retrieval (CIR) aims to retrieve the relevant target image in a database.
Recent research sidesteps this need by using large-scale vision-language models (VLMs)
We propose to tackle CIR in a training-free manner via Vision-by-Language (CIReVL)
arXiv Detail & Related papers (2023-10-13T17:59:38Z) - Sentence-level Prompts Benefit Composed Image Retrieval [69.78119883060006]
Composed image retrieval (CIR) is the task of retrieving specific images by using a query that involves both a reference image and a relative caption.
We propose to leverage pretrained V-L models, e.g., BLIP-2, to generate sentence-level prompts.
Our proposed method performs favorably against the state-of-the-art CIR methods on the Fashion-IQ and CIRR datasets.
arXiv Detail & Related papers (2023-10-09T07:31:44Z) - CoVR-2: Automatic Data Construction for Composed Video Retrieval [59.854331104466254]
Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers both text and image queries together.
We propose a scalable automatic dataset creation methodology that generates triplets given video-caption pairs.
We also expand the scope of the task to include composed video retrieval (CoVR)
arXiv Detail & Related papers (2023-08-28T17:55:33Z) - Zero-Shot Composed Image Retrieval with Textual Inversion [28.513594970580396]
Composed Image Retrieval (CIR) aims to retrieve a target image based on a query composed of a reference image and a relative caption.
We propose a new task, Zero-Shot CIR (ZS-CIR), that aims to address CIR without requiring a labeled training dataset.
arXiv Detail & Related papers (2023-03-27T14:31:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.