Training-free Zero-shot Composed Image Retrieval with Local Concept Reranking
- URL: http://arxiv.org/abs/2312.08924v2
- Date: Sun, 24 Mar 2024 14:23:59 GMT
- Title: Training-free Zero-shot Composed Image Retrieval with Local Concept Reranking
- Authors: Shitong Sun, Fanghua Ye, Shaogang Gong,
- Abstract summary: Composed image retrieval attempts to retrieve an image of interest from gallery images through a composed query of a reference image and its corresponding modified text.
Most current composed image retrieval methods follow a supervised learning approach to training on a costly triplet dataset composed of a reference image, modified text, and a corresponding target image.
We present a new training-free zero-shot composed image retrieval method which translates the query into explicit human-understandable text.
- Score: 34.31345844296072
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Composed image retrieval attempts to retrieve an image of interest from gallery images through a composed query of a reference image and its corresponding modified text. It has recently attracted attention due to the collaboration of information-rich images and concise language to precisely express the requirements of target images. Most current composed image retrieval methods follow a supervised learning approach to training on a costly triplet dataset composed of a reference image, modified text, and a corresponding target image. To avoid difficult to-obtain labeled triplet training data, zero-shot composed image retrieval (ZS-CIR) has been introduced, which aims to retrieve the target image by learning from image-text pairs (self-supervised triplets), without the need for human-labeled triplets. However, this self-supervised triplet learning approach is computationally less effective and less understandable as it assumes the interaction between image and text is conducted with implicit query embedding without explicit semantical interpretation. In this work, we present a new training-free zero-shot composed image retrieval method which translates the query into explicit human-understandable text. This helps improve model learning efficiency to enhance the generalization capacity of foundation models. Further, we introduce a Local Concept Re-ranking (LCR) mechanism to focus on discriminative local information extracted from the modified instructions. Extensive experiments on four ZS-CIR benchmarks show that our method achieves comparable performances to that of the state of-the-art triplet training based methods, but significantly outperforms other training-free methods on the open domain datasets (CIRR, CIRCO and COCO), as well as the fashion domain dataset (FashionIQ).
Related papers
- SCOT: Self-Supervised Contrastive Pretraining For Zero-Shot Compositional Retrieval [7.248145893361865]
Compositional image retrieval (CIR) is a multimodal learning task where a model combines a query image with a user-provided text modification to retrieve a target image.
Existing methods primarily focus on fully-supervised learning, wherein models are trained on datasets of labeled triplets such as FashionIQ and CIRR.
In this work, we propose SCOT, a novel zero-shot compositional pretraining strategy that combines existing large image-text pair datasets with the generative capabilities of large language models to contrastively train an embedding composition network.
arXiv Detail & Related papers (2025-01-12T07:23:49Z) - Reason-before-Retrieve: One-Stage Reflective Chain-of-Thoughts for Training-Free Zero-Shot Composed Image Retrieval [28.018754406453937]
Composed Image Retrieval (CIR) aims to retrieve target images that closely resemble a reference image.
We present One-Stage Reflective Chain-of-Thought Reasoning for ZS-CIR (OSrCIR)
OSrCIR achieves performance gains of 1.80% to 6.44% over existing training-free methods across multiple tasks.
arXiv Detail & Related papers (2024-12-15T06:22:20Z) - Training-free Zero-shot Composed Image Retrieval via Weighted Modality Fusion and Similarity [2.724141845301679]
Composed image retrieval (CIR) formulates the query as a combination of a reference image and modified text.
We introduce a training-free approach for ZS-CIR.
Our approach is simple, easy to implement, and its effectiveness is validated through experiments on the FashionIQ and CIRR datasets.
arXiv Detail & Related papers (2024-09-07T21:52:58Z) - Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval [50.72924579220149]
Composed Image Retrieval (CIR) is a task that retrieves images similar to a query, based on a provided textual modification.
Current techniques rely on supervised learning for CIR models using labeled triplets of the reference image, text, target image.
We propose a new semi-supervised CIR approach where we search for a reference and its related target images in auxiliary data.
arXiv Detail & Related papers (2024-04-23T21:00:22Z) - Vision-by-Language for Training-Free Compositional Image Retrieval [78.60509831598745]
Compositional Image Retrieval (CIR) aims to retrieve the relevant target image in a database.
Recent research sidesteps this need by using large-scale vision-language models (VLMs)
We propose to tackle CIR in a training-free manner via Vision-by-Language (CIReVL)
arXiv Detail & Related papers (2023-10-13T17:59:38Z) - Sentence-level Prompts Benefit Composed Image Retrieval [69.78119883060006]
Composed image retrieval (CIR) is the task of retrieving specific images by using a query that involves both a reference image and a relative caption.
We propose to leverage pretrained V-L models, e.g., BLIP-2, to generate sentence-level prompts.
Our proposed method performs favorably against the state-of-the-art CIR methods on the Fashion-IQ and CIRR datasets.
arXiv Detail & Related papers (2023-10-09T07:31:44Z) - Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image
Retrieval [84.11127588805138]
Composed Image Retrieval (CIR) combines a query image with text to describe their intended target.
Existing methods rely on supervised learning of CIR models using labeled triplets consisting of the query image, text specification, and the target image.
We propose Zero-Shot Composed Image Retrieval (ZS-CIR), whose goal is to build a CIR model without requiring labeled triplets for training.
arXiv Detail & Related papers (2023-02-06T19:40:04Z) - BOSS: Bottom-up Cross-modal Semantic Composition with Hybrid
Counterfactual Training for Robust Content-based Image Retrieval [61.803481264081036]
Content-Based Image Retrieval (CIR) aims to search for a target image by concurrently comprehending the composition of an example image and a complementary text.
We tackle this task by a novel underlinetextbfBottom-up crunderlinetextbfOss-modal underlinetextbfSemantic compounderlinetextbfSition (textbfBOSS) with Hybrid Counterfactual Training framework.
arXiv Detail & Related papers (2022-07-09T07:14:44Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.