Image2Sentence based Asymmetrical Zero-shot Composed Image Retrieval
- URL: http://arxiv.org/abs/2403.01431v1
- Date: Sun, 3 Mar 2024 07:58:03 GMT
- Title: Image2Sentence based Asymmetrical Zero-shot Composed Image Retrieval
- Authors: Yongchao Du, Min Wang, Wengang Zhou, Shuping Hui, Houqiang Li
- Abstract summary: The task of composed image retrieval (CIR) aims to retrieve images based on the query image and the text describing the users' intent.
Existing methods have made great progress with the advanced large vision-language (VL) model in CIR task, however, they generally suffer from two main issues: lack of labeled triplets for model training and difficulty of deployment on resource-restricted environments.
We propose Image2Sentence based Asymmetric zero-shot composed image retrieval (ISA), which takes advantage of the VL model and only relies on unlabeled images for composition learning.
- Score: 92.13664084464514
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The task of composed image retrieval (CIR) aims to retrieve images based on
the query image and the text describing the users' intent. Existing methods
have made great progress with the advanced large vision-language (VL) model in
CIR task, however, they generally suffer from two main issues: lack of labeled
triplets for model training and difficulty of deployment on resource-restricted
environments when deploying the large vision-language model. To tackle the
above problems, we propose Image2Sentence based Asymmetric zero-shot composed
image retrieval (ISA), which takes advantage of the VL model and only relies on
unlabeled images for composition learning. In the framework, we propose a new
adaptive token learner that maps an image to a sentence in the word embedding
space of VL model. The sentence adaptively captures discriminative visual
information and is further integrated with the text modifier. An asymmetric
structure is devised for flexible deployment, in which the lightweight model is
adopted for the query side while the large VL model is deployed on the gallery
side. The global contrastive distillation and the local alignment
regularization are adopted for the alignment between the light model and the VL
model for CIR task. Our experiments demonstrate that the proposed ISA could
better cope with the real retrieval scenarios and further improve retrieval
accuracy and efficiency.
Related papers
- Spherical Linear Interpolation and Text-Anchoring for Zero-shot Composed Image Retrieval [43.47770490199544]
Composed Image Retrieval (CIR) is a complex task that retrieves images using a query, which is configured with an image and a caption.
We introduce a novel ZS-CIR method that uses Spherical Linear Interpolation (Slerp) to directly merge image and text representations.
We also introduce Text-Anchored-Tuning (TAT), a method that fine-tunes the image encoder while keeping the text encoder fixed.
arXiv Detail & Related papers (2024-05-01T15:19:54Z) - Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval [50.72924579220149]
Composed Image Retrieval (CIR) is a task that retrieves images similar to a query, based on a provided textual modification.
Current techniques rely on supervised learning for CIR models using labeled triplets of the reference image, text, target image.
We propose a new semi-supervised CIR approach where we search for a reference and its related target images in auxiliary data.
arXiv Detail & Related papers (2024-04-23T21:00:22Z) - Vision-by-Language for Training-Free Compositional Image Retrieval [78.60509831598745]
Compositional Image Retrieval (CIR) aims to retrieve the relevant target image in a database.
Recent research sidesteps this need by using large-scale vision-language models (VLMs)
We propose to tackle CIR in a training-free manner via Vision-by-Language (CIReVL)
arXiv Detail & Related papers (2023-10-13T17:59:38Z) - Sentence-level Prompts Benefit Composed Image Retrieval [69.78119883060006]
Composed image retrieval (CIR) is the task of retrieving specific images by using a query that involves both a reference image and a relative caption.
We propose to leverage pretrained V-L models, e.g., BLIP-2, to generate sentence-level prompts.
Our proposed method performs favorably against the state-of-the-art CIR methods on the Fashion-IQ and CIRR datasets.
arXiv Detail & Related papers (2023-10-09T07:31:44Z) - Zero-shot Composed Text-Image Retrieval [72.43790281036584]
We consider the problem of composed image retrieval (CIR)
It aims to train a model that can fuse multi-modal information, e.g., text and images, to accurately retrieve images that match the query, extending the user's expression ability.
arXiv Detail & Related papers (2023-06-12T17:56:01Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.