Generating a Paracosm for Training-Free Zero-Shot Composed Image Retrieval
- URL: http://arxiv.org/abs/2602.00813v2
- Date: Tue, 03 Feb 2026 14:05:53 GMT
- Title: Generating a Paracosm for Training-Free Zero-Shot Composed Image Retrieval
- Authors: Tong Wang, Yunhan Zhao, Shu Kong,
- Abstract summary: Composed Image Retrieval (CIR) is the task of retrieving a target image from a database using a multimodal query.<n>The challenge of CIR is that this mental image'' is not physically available and is only implicitly defined by the query.<n>In contrast, we address CIR from first principles by directly generating the mental image'' for more accurate matching.
- Score: 21.229497760570556
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Composed Image Retrieval (CIR) is the task of retrieving a target image from a database using a multimodal query, which consists of a reference image and a modification text. The text specifies how to alter the reference image to form a ``mental image'', based on which CIR should find the target image in the database. The fundamental challenge of CIR is that this ``mental image'' is not physically available and is only implicitly defined by the query. The contemporary literature pursues zero-shot methods and uses a Large Multimodal Model (LMM) to generate a textual description for a given multimodal query, and then employs a Vision-Language Model (VLM) for textual-visual matching to search the target image. In contrast, we address CIR from first principles by directly generating the ``mental image'' for more accurate matching. Particularly, we prompt an LMM to generate a ``mental image'' for a given multimodal query and propose to use this ``mental image'' to search for the target image. As the ``mental image'' has a synthetic-to-real domain gap with real images, we also generate a synthetic counterpart for each real image in the database to facilitate matching. In this sense, our method uses LMM to construct a ``paracosm'', where it matches the multimodal query and database images. Hence, we call this method Paracosm. Notably, Paracosm is a training-free zero-shot CIR method. It significantly outperforms existing zero-shot methods on four challenging benchmarks, achieving state-of-the-art performance for zero-shot CIR.
Related papers
- Fine-Grained Zero-Shot Composed Image Retrieval with Complementary Visual-Semantic Integration [64.12127577975696]
Zero-shot composed image retrieval (ZS-CIR) is a rapidly growing area with significant practical applications.<n>Existing ZS-CIR methods often struggle to capture fine-grained changes and integrate visual and semantic information effectively.<n>We propose a novel Fine-Grained Zero-Shot Composed Image Retrieval method with Complementary Visual-Semantic Integration.
arXiv Detail & Related papers (2026-01-20T15:17:14Z) - Chain-of-Thought Re-ranking for Image Retrieval Tasks [16.13448876168839]
We propose a novel Chain-of-Thought Re-Ranking (CoTRR) method to address image retrieval.<n>By allowing MLLM to perform listwise reasoning, our method supports global comparison, consistent reasoning, and interpretable decision-making.<n>Our method achieves state-of-the-art performance across three image retrieval tasks, including text-to-image retrieval (TIR), composed image retrieval (CIR) and chat-based image retrieval (Chat-IR)
arXiv Detail & Related papers (2025-09-18T08:48:46Z) - Multimodal Reasoning Agent for Zero-Shot Composed Image Retrieval [52.709090256954276]
Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve target images given a compositional query.<n>We propose a novel framework by employing a Multimodal Reasoning Agent (MRA) for ZS-CIR.
arXiv Detail & Related papers (2025-05-26T13:17:50Z) - Training-free Zero-shot Composed Image Retrieval via Weighted Modality Fusion and Similarity [2.724141845301679]
Composed image retrieval (CIR) formulates the query as a combination of a reference image and modified text.
We introduce a training-free approach for ZS-CIR.
Our approach is simple, easy to implement, and its effectiveness is validated through experiments on the FashionIQ and CIRR datasets.
arXiv Detail & Related papers (2024-09-07T21:52:58Z) - Zero-shot Composed Image Retrieval Considering Query-target Relationship Leveraging Masked Image-text Pairs [44.48400303207482]
The objective of a zero-shot composed image retrieval (CIR) is to retrieve the target image using a query image and a query text.
Existing methods use a textual inversion network to convert the query image into a pseudo word to compose the image and text.
We propose a novel zero-shot CIR method that is trained end-to-end using masked image-text pairs.
arXiv Detail & Related papers (2024-06-27T02:10:30Z) - Image2Sentence based Asymmetrical Zero-shot Composed Image Retrieval [92.13664084464514]
The task of composed image retrieval (CIR) aims to retrieve images based on the query image and the text describing the users' intent.
Existing methods have made great progress with the advanced large vision-language (VL) model in CIR task, however, they generally suffer from two main issues: lack of labeled triplets for model training and difficulty of deployment on resource-restricted environments.
We propose Image2Sentence based Asymmetric zero-shot composed image retrieval (ISA), which takes advantage of the VL model and only relies on unlabeled images for composition learning.
arXiv Detail & Related papers (2024-03-03T07:58:03Z) - Vision-by-Language for Training-Free Compositional Image Retrieval [78.60509831598745]
Compositional Image Retrieval (CIR) aims to retrieve the relevant target image in a database.
Recent research sidesteps this need by using large-scale vision-language models (VLMs)
We propose to tackle CIR in a training-free manner via Vision-by-Language (CIReVL)
arXiv Detail & Related papers (2023-10-13T17:59:38Z) - Sentence-level Prompts Benefit Composed Image Retrieval [69.78119883060006]
Composed image retrieval (CIR) is the task of retrieving specific images by using a query that involves both a reference image and a relative caption.
We propose to leverage pretrained V-L models, e.g., BLIP-2, to generate sentence-level prompts.
Our proposed method performs favorably against the state-of-the-art CIR methods on the Fashion-IQ and CIRR datasets.
arXiv Detail & Related papers (2023-10-09T07:31:44Z) - Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image
Retrieval [84.11127588805138]
Composed Image Retrieval (CIR) combines a query image with text to describe their intended target.
Existing methods rely on supervised learning of CIR models using labeled triplets consisting of the query image, text specification, and the target image.
We propose Zero-Shot Composed Image Retrieval (ZS-CIR), whose goal is to build a CIR model without requiring labeled triplets for training.
arXiv Detail & Related papers (2023-02-06T19:40:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.