Sentence-level Prompts Benefit Composed Image Retrieval
- URL: http://arxiv.org/abs/2310.05473v1
- Date: Mon, 9 Oct 2023 07:31:44 GMT
- Title: Sentence-level Prompts Benefit Composed Image Retrieval
- Authors: Yang Bai, Xinxing Xu, Yong Liu, Salman Khan, Fahad Khan, Wangmeng Zuo,
Rick Siow Mong Goh, Chun-Mei Feng
- Abstract summary: Composed image retrieval (CIR) is the task of retrieving specific images by using a query that involves both a reference image and a relative caption.
We propose to leverage pretrained V-L models, e.g., BLIP-2, to generate sentence-level prompts.
Our proposed method performs favorably against the state-of-the-art CIR methods on the Fashion-IQ and CIRR datasets.
- Score: 69.78119883060006
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Composed image retrieval (CIR) is the task of retrieving specific images by
using a query that involves both a reference image and a relative caption. Most
existing CIR models adopt the late-fusion strategy to combine visual and
language features. Besides, several approaches have also been suggested to
generate a pseudo-word token from the reference image, which is further
integrated into the relative caption for CIR. However, these pseudo-word-based
prompting methods have limitations when target image encompasses complex
changes on reference image, e.g., object removal and attribute modification. In
this work, we demonstrate that learning an appropriate sentence-level prompt
for the relative caption (SPRC) is sufficient for achieving effective composed
image retrieval. Instead of relying on pseudo-word-based prompts, we propose to
leverage pretrained V-L models, e.g., BLIP-2, to generate sentence-level
prompts. By concatenating the learned sentence-level prompt with the relative
caption, one can readily use existing text-based image retrieval models to
enhance CIR performance. Furthermore, we introduce both image-text contrastive
loss and text prompt alignment loss to enforce the learning of suitable
sentence-level prompts. Experiments show that our proposed method performs
favorably against the state-of-the-art CIR methods on the Fashion-IQ and CIRR
datasets. The source code and pretrained model are publicly available at
https://github.com/chunmeifeng/SPRC
Related papers
- Training-free Zero-shot Composed Image Retrieval via Weighted Modality Fusion and Similarity [2.724141845301679]
Composed image retrieval (CIR) formulates the query as a combination of a reference image and modified text.
We introduce a training-free approach for ZS-CIR.
Our approach is simple, easy to implement, and its effectiveness is validated through experiments on the FashionIQ and CIRR datasets.
arXiv Detail & Related papers (2024-09-07T21:52:58Z) - Spherical Linear Interpolation and Text-Anchoring for Zero-shot Composed Image Retrieval [43.47770490199544]
Composed Image Retrieval (CIR) is a complex task that retrieves images using a query, which is configured with an image and a caption.
We introduce a novel ZS-CIR method that uses Spherical Linear Interpolation (Slerp) to directly merge image and text representations.
We also introduce Text-Anchored-Tuning (TAT), a method that fine-tunes the image encoder while keeping the text encoder fixed.
arXiv Detail & Related papers (2024-05-01T15:19:54Z) - Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval [50.72924579220149]
Composed Image Retrieval (CIR) is a task that retrieves images similar to a query, based on a provided textual modification.
Current techniques rely on supervised learning for CIR models using labeled triplets of the reference image, text, target image.
We propose a new semi-supervised CIR approach where we search for a reference and its related target images in auxiliary data.
arXiv Detail & Related papers (2024-04-23T21:00:22Z) - Image2Sentence based Asymmetrical Zero-shot Composed Image Retrieval [92.13664084464514]
The task of composed image retrieval (CIR) aims to retrieve images based on the query image and the text describing the users' intent.
Existing methods have made great progress with the advanced large vision-language (VL) model in CIR task, however, they generally suffer from two main issues: lack of labeled triplets for model training and difficulty of deployment on resource-restricted environments.
We propose Image2Sentence based Asymmetric zero-shot composed image retrieval (ISA), which takes advantage of the VL model and only relies on unlabeled images for composition learning.
arXiv Detail & Related papers (2024-03-03T07:58:03Z) - Vision-by-Language for Training-Free Compositional Image Retrieval [78.60509831598745]
Compositional Image Retrieval (CIR) aims to retrieve the relevant target image in a database.
Recent research sidesteps this need by using large-scale vision-language models (VLMs)
We propose to tackle CIR in a training-free manner via Vision-by-Language (CIReVL)
arXiv Detail & Related papers (2023-10-13T17:59:38Z) - Zero-Shot Composed Image Retrieval with Textual Inversion [28.513594970580396]
Composed Image Retrieval (CIR) aims to retrieve a target image based on a query composed of a reference image and a relative caption.
We propose a new task, Zero-Shot CIR (ZS-CIR), that aims to address CIR without requiring a labeled training dataset.
arXiv Detail & Related papers (2023-03-27T14:31:25Z) - Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image
Retrieval [84.11127588805138]
Composed Image Retrieval (CIR) combines a query image with text to describe their intended target.
Existing methods rely on supervised learning of CIR models using labeled triplets consisting of the query image, text specification, and the target image.
We propose Zero-Shot Composed Image Retrieval (ZS-CIR), whose goal is to build a CIR model without requiring labeled triplets for training.
arXiv Detail & Related papers (2023-02-06T19:40:04Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.