Vision-by-Language for Training-Free Compositional Image Retrieval
- URL: http://arxiv.org/abs/2310.09291v2
- Date: Mon, 26 Feb 2024 18:59:49 GMT
- Title: Vision-by-Language for Training-Free Compositional Image Retrieval
- Authors: Shyamgopal Karthik, Karsten Roth, Massimiliano Mancini, Zeynep Akata
- Abstract summary: Compositional Image Retrieval (CIR) aims to retrieve the relevant target image in a database.
Recent research sidesteps this need by using large-scale vision-language models (VLMs)
We propose to tackle CIR in a training-free manner via Vision-by-Language (CIReVL)
- Score: 78.60509831598745
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Given an image and a target modification (e.g an image of the Eiffel tower
and the text "without people and at night-time"), Compositional Image Retrieval
(CIR) aims to retrieve the relevant target image in a database. While
supervised approaches rely on annotating triplets that is costly (i.e. query
image, textual modification, and target image), recent research sidesteps this
need by using large-scale vision-language models (VLMs), performing Zero-Shot
CIR (ZS-CIR). However, state-of-the-art approaches in ZS-CIR still require
training task-specific, customized models over large amounts of image-text
pairs. In this work, we propose to tackle CIR in a training-free manner via our
Compositional Image Retrieval through Vision-by-Language (CIReVL), a simple,
yet human-understandable and scalable pipeline that effectively recombines
large-scale VLMs with large language models (LLMs). By captioning the reference
image using a pre-trained generative VLM and asking a LLM to recompose the
caption based on the textual target modification for subsequent retrieval via
e.g. CLIP, we achieve modular language reasoning. In four ZS-CIR benchmarks, we
find competitive, in-part state-of-the-art performance - improving over
supervised methods. Moreover, the modularity of CIReVL offers simple
scalability without re-training, allowing us to both investigate scaling laws
and bottlenecks for ZS-CIR while easily scaling up to in parts more than double
of previously reported results. Finally, we show that CIReVL makes CIR
human-understandable by composing image and text in a modular fashion in the
language domain, thereby making it intervenable, allowing to post-hoc re-align
failure cases. Code will be released upon acceptance.
Related papers
- Training-free Zero-shot Composed Image Retrieval via Weighted Modality Fusion and Similarity [2.724141845301679]
Composed image retrieval (CIR) formulates the query as a combination of a reference image and modified text.
We introduce a training-free approach for ZS-CIR.
Our approach is simple, easy to implement, and its effectiveness is validated through experiments on the FashionIQ and CIRR datasets.
arXiv Detail & Related papers (2024-09-07T21:52:58Z) - Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval [50.72924579220149]
Composed Image Retrieval (CIR) is a task that retrieves images similar to a query, based on a provided textual modification.
Current techniques rely on supervised learning for CIR models using labeled triplets of the reference image, text, target image.
We propose a new semi-supervised CIR approach where we search for a reference and its related target images in auxiliary data.
arXiv Detail & Related papers (2024-04-23T21:00:22Z) - Image2Sentence based Asymmetrical Zero-shot Composed Image Retrieval [92.13664084464514]
The task of composed image retrieval (CIR) aims to retrieve images based on the query image and the text describing the users' intent.
Existing methods have made great progress with the advanced large vision-language (VL) model in CIR task, however, they generally suffer from two main issues: lack of labeled triplets for model training and difficulty of deployment on resource-restricted environments.
We propose Image2Sentence based Asymmetric zero-shot composed image retrieval (ISA), which takes advantage of the VL model and only relies on unlabeled images for composition learning.
arXiv Detail & Related papers (2024-03-03T07:58:03Z) - Pretrain like Your Inference: Masked Tuning Improves Zero-Shot Composed
Image Retrieval [17.70430913227593]
We introduce a novel unlabeled and pre-trained masked tuning approach to reduce the gap between the pre-trained model and the downstream CIR task.
With such a simple design, it can learn to capture fine-grained text-guided modifications.
arXiv Detail & Related papers (2023-11-13T02:49:57Z) - Sentence-level Prompts Benefit Composed Image Retrieval [69.78119883060006]
Composed image retrieval (CIR) is the task of retrieving specific images by using a query that involves both a reference image and a relative caption.
We propose to leverage pretrained V-L models, e.g., BLIP-2, to generate sentence-level prompts.
Our proposed method performs favorably against the state-of-the-art CIR methods on the Fashion-IQ and CIRR datasets.
arXiv Detail & Related papers (2023-10-09T07:31:44Z) - Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image
Retrieval [84.11127588805138]
Composed Image Retrieval (CIR) combines a query image with text to describe their intended target.
Existing methods rely on supervised learning of CIR models using labeled triplets consisting of the query image, text specification, and the target image.
We propose Zero-Shot Composed Image Retrieval (ZS-CIR), whose goal is to build a CIR model without requiring labeled triplets for training.
arXiv Detail & Related papers (2023-02-06T19:40:04Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.