Zero Shot Composed Image Retrieval
- URL: http://arxiv.org/abs/2506.06602v1
- Date: Sat, 07 Jun 2025 00:38:43 GMT
- Title: Zero Shot Composed Image Retrieval
- Authors: Santhosh Kakarla, Gautama Shastry Bulusu Venkata,
- Abstract summary: Composed image retrieval (CIR) allows a user to locate a target image by applying a fine-grained textual edit.<n>Zero-shot CIR, which embeds the image and the text with separate pretrained vision-language encoders, reaches only 20-25% Recall@10 on the FashionIQ benchmark.<n>We improve this by fine-tuning BLIP-2 with a lightweight Q-Former that fuses visual and textual features into a single embedding.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Composed image retrieval (CIR) allows a user to locate a target image by applying a fine-grained textual edit (e.g., ``turn the dress blue'' or ``remove stripes'') to a reference image. Zero-shot CIR, which embeds the image and the text with separate pretrained vision-language encoders, reaches only 20-25\% Recall@10 on the FashionIQ benchmark. We improve this by fine-tuning BLIP-2 with a lightweight Q-Former that fuses visual and textual features into a single embedding, raising Recall@10 to 45.6\% (shirt), 40.1\% (dress), and 50.4\% (top-tee) and increasing the average Recall@50 to 67.6\%. We also examine Retrieval-DPO, which fine-tunes CLIP's text encoder with a Direct Preference Optimization loss applied to FAISS-mined hard negatives. Despite extensive tuning of the scaling factor, index, and sampling strategy, Retrieval-DPO attains only 0.02\% Recall@10 -- far below zero-shot and prompt-tuned baselines -- because it (i) lacks joint image-text fusion, (ii) uses a margin objective misaligned with top-$K$ metrics, (iii) relies on low-quality negatives, and (iv) keeps the vision and Transformer layers frozen. Our results show that effective preference-based CIR requires genuine multimodal fusion, ranking-aware objectives, and carefully curated negatives.
Related papers
- MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval [32.33545237942899]
Composed Image Retrieval (CIR) is the task of retrieving a target image from a gallery using a reference image and a modification text.<n>We propose Chain-of-Thought with re-ranking (MCoT-RE) as a training-free zero-shot CIR framework.
arXiv Detail & Related papers (2025-07-17T06:22:49Z) - QuRe: Query-Relevant Retrieval through Hard Negative Sampling in Composed Image Retrieval [24.699637275626998]
Composed Image Retrieval (CIR) retrieves relevant images based on a reference image and accompanying text describing desired modifications.<n>This limitation arises because most methods employing contrastive learning treat the target image as positive and all other images in the batch as negatives.<n>We propose Query-Relevant Retrieval through Hard Negative Sampling (QuRe), which optimize a reward model objective to reduce false negatives.
arXiv Detail & Related papers (2025-07-16T17:06:33Z) - Multimodal Reasoning Agent for Zero-Shot Composed Image Retrieval [52.709090256954276]
Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve target images given a compositional query.<n>We propose a novel framework by employing a Multimodal Reasoning Agent (MRA) for ZS-CIR.
arXiv Detail & Related papers (2025-05-26T13:17:50Z) - SPARC: Score Prompting and Adaptive Fusion for Zero-Shot Multi-Label Recognition in Vision-Language Models [74.40683913645731]
Zero-shot multi-label recognition (MLR) with Vision-Language Models (VLMs) faces significant challenges without training data, model tuning, or architectural modifications.<n>Our work proposes a novel solution treating VLMs as black boxes, leveraging scores without training data or ground truth.<n>Analysis of these prompt scores reveals VLM biases and AND''/OR' signal ambiguities, notably that maximum scores are surprisingly suboptimal compared to second-highest scores.
arXiv Detail & Related papers (2025-02-24T07:15:05Z) - Reason-before-Retrieve: One-Stage Reflective Chain-of-Thoughts for Training-Free Zero-Shot Composed Image Retrieval [28.018754406453937]
Composed Image Retrieval (CIR) aims to retrieve target images that closely resemble a reference image.<n>We present One-Stage Reflective Chain-of-Thought Reasoning for ZS-CIR (OSrCIR)<n>OSrCIR achieves performance gains of 1.80% to 6.44% over existing training-free methods across multiple tasks.
arXiv Detail & Related papers (2024-12-15T06:22:20Z) - An Efficient Post-hoc Framework for Reducing Task Discrepancy of Text Encoders for Composed Image Retrieval [34.065449743428005]
Composed Image Retrieval (CIR) aims to retrieve a target image based on a reference image and conditioning text, enabling controllable image searches.<n>Traditional Zero-Shot (ZS) CIR methods bypass the need for expensive training CIR triplets by projecting image embeddings into the text token embedding space.<n>We introduce Reducing Taskrepancy of Texts (RTD), an efficient text-only framework that complements projection-based CIR methods.
arXiv Detail & Related papers (2024-06-13T14:49:28Z) - Sentence-level Prompts Benefit Composed Image Retrieval [69.78119883060006]
Composed image retrieval (CIR) is the task of retrieving specific images by using a query that involves both a reference image and a relative caption.
We propose to leverage pretrained V-L models, e.g., BLIP-2, to generate sentence-level prompts.
Our proposed method performs favorably against the state-of-the-art CIR methods on the Fashion-IQ and CIRR datasets.
arXiv Detail & Related papers (2023-10-09T07:31:44Z) - Turning a CLIP Model into a Scene Text Spotter [73.63953542526917]
We exploit the potential of the large-scale Contrastive Language-Image Pretraining (CLIP) model to enhance scene text detection and spotting tasks.
This backbone utilizes visual prompt learning and cross-attention in CLIP to extract image and text-based prior knowledge.
FastTCM-CR50 introduces an instance-language matching process to enhance the synergy between image and text embeddings.
arXiv Detail & Related papers (2023-08-21T01:25:48Z) - Expressive Losses for Verified Robustness via Convex Combinations [67.54357965665676]
We study the relationship between the over-approximation coefficient and performance profiles across different expressive losses.
We show that, while expressivity is essential, better approximations of the worst-case loss are not necessarily linked to superior robustness-accuracy trade-offs.
arXiv Detail & Related papers (2023-05-23T12:20:29Z) - Improving Zero-shot Generalization and Robustness of Multi-modal Models [70.14692320804178]
Multi-modal image-text models such as CLIP and LiT have demonstrated impressive performance on image classification benchmarks.
We investigate the reasons for this performance gap and find that many of the failure cases are caused by ambiguity in the text prompts.
We propose a simple and efficient way to improve accuracy on such uncertain images by making use of the WordNet hierarchy.
arXiv Detail & Related papers (2022-12-04T07:26:24Z) - Reducing Predictive Feature Suppression in Resource-Constrained
Contrastive Image-Caption Retrieval [65.33981533521207]
We introduce an approach to reduce predictive feature suppression for resource-constrained ICR methods: latent target decoding (LTD)
LTD reconstructs the input caption in a latent space of a general-purpose sentence encoder, which prevents the image and caption encoder from suppressing predictive features.
Our experiments show that, unlike reconstructing the input caption in the input space, LTD reduces predictive feature suppression, measured by obtaining higher recall@k, r-precision, and nDCG scores.
arXiv Detail & Related papers (2022-04-28T09:55:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.