Spherical Linear Interpolation and Text-Anchoring for Zero-shot Composed Image Retrieval
- URL: http://arxiv.org/abs/2405.00571v1
- Date: Wed, 1 May 2024 15:19:54 GMT
- Title: Spherical Linear Interpolation and Text-Anchoring for Zero-shot Composed Image Retrieval
- Authors: Young Kyun Jang, Dat Huynh, Ashish Shah, Wen-Kai Chen, Ser-Nam Lim,
- Abstract summary: Composed Image Retrieval (CIR) is a complex task that retrieves images using a query, which is configured with an image and a caption.
We introduce a novel ZS-CIR method that uses Spherical Linear Interpolation (Slerp) to directly merge image and text representations.
We also introduce Text-Anchored-Tuning (TAT), a method that fine-tunes the image encoder while keeping the text encoder fixed.
- Score: 43.47770490199544
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Composed Image Retrieval (CIR) is a complex task that retrieves images using a query, which is configured with an image and a caption that describes desired modifications to that image. Supervised CIR approaches have shown strong performance, but their reliance on expensive manually-annotated datasets restricts their scalability and broader applicability. To address these issues, previous studies have proposed pseudo-word token-based Zero-Shot CIR (ZS-CIR) methods, which utilize a projection module to map images to word tokens. However, we conjecture that this approach has a downside: the projection module distorts the original image representation and confines the resulting composed embeddings to the text-side. In order to resolve this, we introduce a novel ZS-CIR method that uses Spherical Linear Interpolation (Slerp) to directly merge image and text representations by identifying an intermediate embedding of both. Furthermore, we introduce Text-Anchored-Tuning (TAT), a method that fine-tunes the image encoder while keeping the text encoder fixed. TAT closes the modality gap between images and text, making the Slerp process much more effective. Notably, the TAT method is not only efficient in terms of the scale of the training dataset and training time, but it also serves as an excellent initial checkpoint for training supervised CIR models, thereby highlighting its wider potential. The integration of the Slerp-based ZS-CIR with a TAT-tuned model enables our approach to deliver state-of-the-art retrieval performance across CIR benchmarks.
Related papers
- PDV: Prompt Directional Vectors for Zero-shot Composed Image Retrieval [37.95145173167645]
We introduce Prompt Directional Vector (PDV), a simple yet effective training-free enhancement that captures semantic modifications induced by user prompts.
PDV enables three key improvements: (1) dynamic composed text embeddings where prompt adjustments are controllable via a scaling factor, (2) composed image embeddings through semantic transfer from text prompts to image features, and (3) weighted fusion of composed text and image embeddings.
arXiv Detail & Related papers (2025-02-11T03:20:21Z) - Compositional Image Retrieval via Instruction-Aware Contrastive Learning [40.54022628032561]
Composed Image Retrieval (CIR) involves retrieving a target image based on a composed query of an image paired with text that specifies modifications or changes to the visual reference.
In practice, due to the scarcity of annotated data in downstream tasks, Zero-Shot CIR (ZS-CIR) is desirable.
We propose a novel embedding method utilizing an instruction-tuned Multimodal LLM (MLLM) to generate composed representation.
arXiv Detail & Related papers (2024-12-07T22:46:52Z) - Training-free Zero-shot Composed Image Retrieval via Weighted Modality Fusion and Similarity [2.724141845301679]
Composed image retrieval (CIR) formulates the query as a combination of a reference image and modified text.
We introduce a training-free approach for ZS-CIR.
Our approach is simple, easy to implement, and its effectiveness is validated through experiments on the FashionIQ and CIRR datasets.
arXiv Detail & Related papers (2024-09-07T21:52:58Z) - iSEARLE: Improving Textual Inversion for Zero-Shot Composed Image Retrieval [26.101116761577796]
Composed Image Retrieval (CIR) aims to retrieve target images visually similar to the reference one while incorporating the changes specified in the relative caption.
We introduce a new task, Zero-Shot CIR (ZS-CIR), that addresses CIR without the need for a labeled training dataset.
We present an open-domain benchmarking dataset named CIRCO, where each query is labeled with multiple ground truths and a semantic categorization.
arXiv Detail & Related papers (2024-05-05T14:39:06Z) - Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval [50.72924579220149]
Composed Image Retrieval (CIR) is a task that retrieves images similar to a query, based on a provided textual modification.
Current techniques rely on supervised learning for CIR models using labeled triplets of the reference image, text, target image.
We propose a new semi-supervised CIR approach where we search for a reference and its related target images in auxiliary data.
arXiv Detail & Related papers (2024-04-23T21:00:22Z) - Image2Sentence based Asymmetrical Zero-shot Composed Image Retrieval [92.13664084464514]
The task of composed image retrieval (CIR) aims to retrieve images based on the query image and the text describing the users' intent.
Existing methods have made great progress with the advanced large vision-language (VL) model in CIR task, however, they generally suffer from two main issues: lack of labeled triplets for model training and difficulty of deployment on resource-restricted environments.
We propose Image2Sentence based Asymmetric zero-shot composed image retrieval (ISA), which takes advantage of the VL model and only relies on unlabeled images for composition learning.
arXiv Detail & Related papers (2024-03-03T07:58:03Z) - Vision-by-Language for Training-Free Compositional Image Retrieval [78.60509831598745]
Compositional Image Retrieval (CIR) aims to retrieve the relevant target image in a database.
Recent research sidesteps this need by using large-scale vision-language models (VLMs)
We propose to tackle CIR in a training-free manner via Vision-by-Language (CIReVL)
arXiv Detail & Related papers (2023-10-13T17:59:38Z) - Sentence-level Prompts Benefit Composed Image Retrieval [69.78119883060006]
Composed image retrieval (CIR) is the task of retrieving specific images by using a query that involves both a reference image and a relative caption.
We propose to leverage pretrained V-L models, e.g., BLIP-2, to generate sentence-level prompts.
Our proposed method performs favorably against the state-of-the-art CIR methods on the Fashion-IQ and CIRR datasets.
arXiv Detail & Related papers (2023-10-09T07:31:44Z) - Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image
Retrieval [84.11127588805138]
Composed Image Retrieval (CIR) combines a query image with text to describe their intended target.
Existing methods rely on supervised learning of CIR models using labeled triplets consisting of the query image, text specification, and the target image.
We propose Zero-Shot Composed Image Retrieval (ZS-CIR), whose goal is to build a CIR model without requiring labeled triplets for training.
arXiv Detail & Related papers (2023-02-06T19:40:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.