Fine-Grained Zero-Shot Composed Image Retrieval with Complementary Visual-Semantic Integration
- URL: http://arxiv.org/abs/2601.14060v1
- Date: Tue, 20 Jan 2026 15:17:14 GMT
- Title: Fine-Grained Zero-Shot Composed Image Retrieval with Complementary Visual-Semantic Integration
- Authors: Yongcong Ye, Kai Zhang, Yanghai Zhang, Enhong Chen, Longfei Li, Jun Zhou,
- Abstract summary: Zero-shot composed image retrieval (ZS-CIR) is a rapidly growing area with significant practical applications.<n>Existing ZS-CIR methods often struggle to capture fine-grained changes and integrate visual and semantic information effectively.<n>We propose a novel Fine-Grained Zero-Shot Composed Image Retrieval method with Complementary Visual-Semantic Integration.
- Score: 64.12127577975696
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Zero-shot composed image retrieval (ZS-CIR) is a rapidly growing area with significant practical applications, allowing users to retrieve a target image by providing a reference image and a relative caption describing the desired modifications. Existing ZS-CIR methods often struggle to capture fine-grained changes and integrate visual and semantic information effectively. They primarily rely on either transforming the multimodal query into a single text using image-to-text models or employing large language models for target image description generation, approaches that often fail to capture complementary visual information and complete semantic context. To address these limitations, we propose a novel Fine-Grained Zero-Shot Composed Image Retrieval method with Complementary Visual-Semantic Integration (CVSI). Specifically, CVSI leverages three key components: (1) Visual Information Extraction, which not only extracts global image features but also uses a pre-trained mapping network to convert the image into a pseudo token, combining it with the modification text and the objects most likely to be added. (2) Semantic Information Extraction, which involves using a pre-trained captioning model to generate multiple captions for the reference image, followed by leveraging an LLM to generate the modified captions and the objects most likely to be added. (3) Complementary Information Retrieval, which integrates information extracted from both the query and database images to retrieve the target image, enabling the system to efficiently handle retrieval queries in a variety of situations. Extensive experiments on three public datasets (e.g., CIRR, CIRCO, and FashionIQ) demonstrate that CVSI significantly outperforms existing state-of-the-art methods. Our code is available at https://github.com/yyc6631/CVSI.
Related papers
- Scaling Prompt Instructed Zero Shot Composed Image Retrieval with Image-Only Data [39.17652541259225]
Composed Image Retrieval (CIR) is the task of retrieving images matching a reference image augmented with a text.<n>We introduce an embedding reformulation architecture that effectively combines image and text modalities.<n>Our model, named InstructCIR, outperforms state-of-the-art methods in zero-shot composed image retrieval on CIRR and FashionIQ datasets.
arXiv Detail & Related papers (2025-04-01T14:03:46Z) - Fine-grained Textual Inversion Network for Zero-Shot Composed Image Retrieval [60.20835288280572]
We propose a novel Fine-grained Textual Inversion Network for ZS-CIR, named FTI4CIR.<n> FTI4CIR comprises two main components: fine-grained pseudo-word token mapping and tri-wise caption-based semantic regularization.
arXiv Detail & Related papers (2025-03-25T02:51:25Z) - Data-Efficient Generalization for Zero-shot Composed Image Retrieval [67.46975191141928]
ZS-CIR aims to retrieve the target image based on a reference image and a text description without requiring in-distribution triplets for training.<n>One prevalent approach follows the vision-language pretraining paradigm that employs a mapping network to transfer the image embedding to a pseudo-word token in the text embedding space.<n>We propose a Data-efficient Generalization (DeG) framework, including two novel designs, namely, Textual Supplement (TS) module and Semantic-Set (S-Set)
arXiv Detail & Related papers (2025-03-07T07:49:31Z) - FLAIR: VLM with Fine-grained Language-informed Image Representations [49.2684130383925]
FLAIR is an approach that utilizes long and detailed image descriptions to learn localized image embeddings.<n>Our experiments demonstrate the effectiveness of FLAIR trained on 30M image-text pairs in capturing fine-grained visual information.
arXiv Detail & Related papers (2024-12-04T18:56:04Z) - CLIP-SCGI: Synthesized Caption-Guided Inversion for Person Re-Identification [9.996589403019675]
Person re-identification (ReID) has recently benefited from large pretrained vision-language models such as Contrastive Language-Image Pre-Training (CLIP)
We propose one straightforward solution by leveraging existing image captioning models to generate pseudo captions for person images.
We introduce CLIP-SCGI, a framework that leverages synthesized captions to guide the learning of discriminative and robust representations.
arXiv Detail & Related papers (2024-10-12T06:24:33Z) - Training-free Zero-shot Composed Image Retrieval via Weighted Modality Fusion and Similarity [2.724141845301679]
Composed image retrieval (CIR) formulates the query as a combination of a reference image and modified text.
We introduce a training-free approach for ZS-CIR.
Our approach is simple, easy to implement, and its effectiveness is validated through experiments on the FashionIQ and CIRR datasets.
arXiv Detail & Related papers (2024-09-07T21:52:58Z) - Training-free Zero-shot Composed Image Retrieval with Local Concept Reranking [34.31345844296072]
Composed image retrieval attempts to retrieve an image of interest from gallery images through a composed query of a reference image and its corresponding modified text.
Most current composed image retrieval methods follow a supervised learning approach to training on a costly triplet dataset composed of a reference image, modified text, and a corresponding target image.
We present a new training-free zero-shot composed image retrieval method which translates the query into explicit human-understandable text.
arXiv Detail & Related papers (2023-12-14T13:31:01Z) - Vision-by-Language for Training-Free Compositional Image Retrieval [78.60509831598745]
Compositional Image Retrieval (CIR) aims to retrieve the relevant target image in a database.
Recent research sidesteps this need by using large-scale vision-language models (VLMs)
We propose to tackle CIR in a training-free manner via Vision-by-Language (CIReVL)
arXiv Detail & Related papers (2023-10-13T17:59:38Z) - Sentence-level Prompts Benefit Composed Image Retrieval [69.78119883060006]
Composed image retrieval (CIR) is the task of retrieving specific images by using a query that involves both a reference image and a relative caption.
We propose to leverage pretrained V-L models, e.g., BLIP-2, to generate sentence-level prompts.
Our proposed method performs favorably against the state-of-the-art CIR methods on the Fashion-IQ and CIRR datasets.
arXiv Detail & Related papers (2023-10-09T07:31:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.