HyCIR: Boosting Zero-Shot Composed Image Retrieval with Synthetic Labels
- URL: http://arxiv.org/abs/2407.05795v2
- Date: Tue, 9 Jul 2024 03:44:55 GMT
- Title: HyCIR: Boosting Zero-Shot Composed Image Retrieval with Synthetic Labels
- Authors: Yingying Jiang, Hanchao Jia, Xiaobing Wang, Peng Hao,
- Abstract summary: Composed Image Retrieval (CIR) aims to retrieve images based on a query image with text.
Current Zero-Shot CIR (ZS-CIR) methods try to solve CIR tasks without using expensive triplet-labeled training datasets.
We propose Hybrid CIR (HyCIR), which uses synthetic labels to boost the performance of ZS-CIR.
- Score: 5.34016463729574
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Composed Image Retrieval (CIR) aims to retrieve images based on a query image with text. Current Zero-Shot CIR (ZS-CIR) methods try to solve CIR tasks without using expensive triplet-labeled training datasets. However, the gap between ZS-CIR and triplet-supervised CIR is still large. In this work, we propose Hybrid CIR (HyCIR), which uses synthetic labels to boost the performance of ZS-CIR. A new label Synthesis pipeline for CIR (SynCir) is proposed, in which only unlabeled images are required. First, image pairs are extracted based on visual similarity. Second, query text is generated for each image pair based on vision-language model and LLM. Third, the data is further filtered in language space based on semantic similarity. To improve ZS-CIR performance, we propose a hybrid training strategy to work with both ZS-CIR supervision and synthetic CIR triplets. Two kinds of contrastive learning are adopted. One is to use large-scale unlabeled image dataset to learn an image-to-text mapping with good generalization. The other is to use synthetic CIR triplets to learn a better mapping for CIR tasks. Our approach achieves SOTA zero-shot performance on the common CIR benchmarks: CIRR and CIRCO.
Related papers
- iSEARLE: Improving Textual Inversion for Zero-Shot Composed Image Retrieval [26.101116761577796]
Composed Image Retrieval (CIR) aims to retrieve target images visually similar to the reference one while incorporating the changes specified in the relative caption.
We introduce a new task, Zero-Shot CIR (ZS-CIR), that addresses CIR without the need for a labeled training dataset.
We present an open-domain benchmarking dataset named CIRCO, where each query is labeled with multiple ground truths and a semantic categorization.
arXiv Detail & Related papers (2024-05-05T14:39:06Z) - Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval [50.72924579220149]
Composed Image Retrieval (CIR) is a task that retrieves images similar to a query, based on a provided textual modification.
Current techniques rely on supervised learning for CIR models using labeled triplets of the reference image, text, target image.
We propose a new semi-supervised CIR approach where we search for a reference and its related target images in auxiliary data.
arXiv Detail & Related papers (2024-04-23T21:00:22Z) - Language-only Efficient Training of Zero-shot Composed Image Retrieval [46.93446891158521]
Composed image retrieval (CIR) task takes a composed query of image and text, aiming to search relative images for both conditions.
We propose a novel CIR framework, only using language for its training.
Our LinCIR (Language-only training for CIR) can be trained only with text datasets by a novel self-supervision named self-masking projection (SMP)
arXiv Detail & Related papers (2023-12-04T16:22:06Z) - Vision-by-Language for Training-Free Compositional Image Retrieval [78.60509831598745]
Compositional Image Retrieval (CIR) aims to retrieve the relevant target image in a database.
Recent research sidesteps this need by using large-scale vision-language models (VLMs)
We propose to tackle CIR in a training-free manner via Vision-by-Language (CIReVL)
arXiv Detail & Related papers (2023-10-13T17:59:38Z) - Sentence-level Prompts Benefit Composed Image Retrieval [69.78119883060006]
Composed image retrieval (CIR) is the task of retrieving specific images by using a query that involves both a reference image and a relative caption.
We propose to leverage pretrained V-L models, e.g., BLIP-2, to generate sentence-level prompts.
Our proposed method performs favorably against the state-of-the-art CIR methods on the Fashion-IQ and CIRR datasets.
arXiv Detail & Related papers (2023-10-09T07:31:44Z) - Zero-Shot Composed Image Retrieval with Textual Inversion [28.513594970580396]
Composed Image Retrieval (CIR) aims to retrieve a target image based on a query composed of a reference image and a relative caption.
We propose a new task, Zero-Shot CIR (ZS-CIR), that aims to address CIR without requiring a labeled training dataset.
arXiv Detail & Related papers (2023-03-27T14:31:25Z) - DiffIR: Efficient Diffusion Model for Image Restoration [108.82579440308267]
Diffusion model (DM) has achieved SOTA performance by modeling the image synthesis process into a sequential application of a denoising network.
Traditional DMs running massive iterations on a large model to estimate whole images or feature maps is inefficient for image restoration.
We propose DiffIR, which consists of a compact IR prior extraction network (CPEN), dynamic IR transformer (DIRformer), and denoising network.
arXiv Detail & Related papers (2023-03-16T16:47:14Z) - Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image
Retrieval [84.11127588805138]
Composed Image Retrieval (CIR) combines a query image with text to describe their intended target.
Existing methods rely on supervised learning of CIR models using labeled triplets consisting of the query image, text specification, and the target image.
We propose Zero-Shot Composed Image Retrieval (ZS-CIR), whose goal is to build a CIR model without requiring labeled triplets for training.
arXiv Detail & Related papers (2023-02-06T19:40:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.