SETR: A Two-Stage Semantic-Enhanced Framework for Zero-Shot Composed Image Retrieval
- URL: http://arxiv.org/abs/2509.26012v1
- Date: Tue, 30 Sep 2025 09:41:52 GMT
- Title: SETR: A Two-Stage Semantic-Enhanced Framework for Zero-Shot Composed Image Retrieval
- Authors: Yuqi Xiao, Yingying Zhu,
- Abstract summary: Zero-shot Composed Image Retrieval (ZS-CIR) aims to retrieve a target image given a reference image and a relative text, without relying on costly triplet annotations.<n>Existing CLIP-based methods face two core challenges: (1) union-based feature fusion indiscriminately aggregates all visual cues, carrying over irrelevant background details that dilute the intended modification, and (2) global cosine similarity from CLIP embeddings lacks the ability to resolve fine-grained semantic relations.
- Score: 4.230223288110963
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Zero-shot Composed Image Retrieval (ZS-CIR) aims to retrieve a target image given a reference image and a relative text, without relying on costly triplet annotations. Existing CLIP-based methods face two core challenges: (1) union-based feature fusion indiscriminately aggregates all visual cues, carrying over irrelevant background details that dilute the intended modification, and (2) global cosine similarity from CLIP embeddings lacks the ability to resolve fine-grained semantic relations. To address these issues, we propose SETR (Semantic-enhanced Two-Stage Retrieval). In the coarse retrieval stage, SETR introduces an intersection-driven strategy that retains only the overlapping semantics between the reference image and relative text, thereby filtering out distractors inherent to union-based fusion and producing a cleaner, high-precision candidate set. In the fine-grained re-ranking stage, we adapt a pretrained multimodal LLM with Low-Rank Adaptation to conduct binary semantic relevance judgments ("Yes/No"), which goes beyond CLIP's global feature matching by explicitly verifying relational and attribute-level consistency. Together, these two stages form a complementary pipeline: coarse retrieval narrows the candidate pool with high recall, while re-ranking ensures precise alignment with nuanced textual modifications. Experiments on CIRR, Fashion-IQ, and CIRCO show that SETR achieves new state-of-the-art performance, improving Recall@1 on CIRR by up to 15.15 points. Our results establish two-stage reasoning as a general paradigm for robust and portable ZS-CIR.
Related papers
- Mine and Refine: Optimizing Graded Relevance in E-commerce Search Retrieval [3.1241290518951197]
Large scale e-commerce search demands embeddings that generalize to long tail, noisy queries.<n>We propose a two-stage "Mine and Refine" contrastive training framework for semantic text embeddings.
arXiv Detail & Related papers (2026-02-19T18:56:36Z) - ReCALL: Recalibrating Capability Degradation for MLLM-based Composed Image Retrieval [64.14282916266998]
Composed Image Retrieval aims to retrieve target images based on a hybrid query comprising a reference image and a modification text.<n>We propose ReCALL, a model-agnostic framework that follows a diagnose-generate-refine pipeline.<n>Experiments on CIRR and FashionIQ show that ReCALL consistently recalibrates degraded capabilities and achieves state-of-the-art performance.
arXiv Detail & Related papers (2026-02-02T04:52:54Z) - Soft Filtering: Guiding Zero-shot Composed Image Retrieval with Prescriptive and Proscriptive Constraints [3.5491867489872413]
Composed Image Retrieval (CIR) aims to find a target image that aligns with user intent, expressed through a reference image and a modification text.<n>Current CIR benchmarks assume a single correct target per query, overlooking the ambiguity in modification texts.<n>We propose Soft Filtering with Textual constraints (SoFT), a training-free, plug-and-play filtering module for ZS-CIR.
arXiv Detail & Related papers (2025-12-23T21:29:45Z) - FAR-Net: Multi-Stage Fusion Network with Enhanced Semantic Alignment and Adaptive Reconciliation for Composed Image Retrieval [36.03123811283016]
We propose FAR-Net, a multi-stage fusion framework designed with enhanced semantic alignment and adaptive reconciliation.<n>Experiments on CIRR and FashionIQ show consistent performance gains, improving Recall@1 by up to 2.4% and Recall@50 by 1.04% over existing methods.
arXiv Detail & Related papers (2025-07-17T06:30:41Z) - CoTMR: Chain-of-Thought Multi-Scale Reasoning for Training-Free Zero-Shot Composed Image Retrieval [13.59418209417664]
Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve target images by integrating information from a composed query without training samples.<n>We propose CoTMR, a training-free framework crafted for ZS-CIR with novel Chain-of-thought (CoT) and Multi-scale Reasoning.
arXiv Detail & Related papers (2025-02-28T08:12:23Z) - Spatial Semantic Recurrent Mining for Referring Image Segmentation [63.34997546393106]
We propose Stextsuperscript2RM to achieve high-quality cross-modality fusion.
It follows a working strategy of trilogy: distributing language feature, spatial semantic recurrent coparsing, and parsed-semantic balancing.
Our proposed method performs favorably against other state-of-the-art algorithms.
arXiv Detail & Related papers (2024-05-15T00:17:48Z) - Noisy-Correspondence Learning for Text-to-Image Person Re-identification [50.07634676709067]
We propose a novel Robust Dual Embedding method (RDE) to learn robust visual-semantic associations even with noisy correspondences.
Our method achieves state-of-the-art results both with and without synthetic noisy correspondences on three datasets.
arXiv Detail & Related papers (2023-08-19T05:34:13Z) - Plug-and-Play Regulators for Image-Text Matching [76.28522712930668]
Exploiting fine-grained correspondence and visual-semantic alignments has shown great potential in image-text matching.
We develop two simple but quite effective regulators which efficiently encode the message output to automatically contextualize and aggregate cross-modal representations.
Experiments on MSCOCO and Flickr30K datasets validate that they can bring an impressive and consistent R@1 gain on multiple models.
arXiv Detail & Related papers (2023-03-23T15:42:05Z) - Learning Self-Supervised Low-Rank Network for Single-Stage Weakly and
Semi-Supervised Semantic Segmentation [119.009033745244]
This paper presents a Self-supervised Low-Rank Network ( SLRNet) for single-stage weakly supervised semantic segmentation (WSSS) and semi-supervised semantic segmentation (SSSS)
SLRNet uses cross-view self-supervision, that is, it simultaneously predicts several attentive LR representations from different views of an image to learn precise pseudo-labels.
Experiments on the Pascal VOC 2012, COCO, and L2ID datasets demonstrate that our SLRNet outperforms both state-of-the-art WSSS and SSSS methods with a variety of different settings.
arXiv Detail & Related papers (2022-03-19T09:19:55Z) - Robust Reference-based Super-Resolution via C2-Matching [77.51610726936657]
Super-Resolution (Ref-SR) has recently emerged as a promising paradigm to enhance a low-resolution (LR) input image by introducing an additional high-resolution (HR) reference image.
Existing Ref-SR methods mostly rely on implicit correspondence matching to borrow HR textures from reference images to compensate for the information loss in input images.
We propose C2-Matching, which produces explicit robust matching crossing transformation and resolution.
arXiv Detail & Related papers (2021-06-03T16:40:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.