Related papers: CIR-CoT: Towards Interpretable Composed Image Retrieval via End-to-End Chain-of-Thought Reasoning

CIR-CoT: Towards Interpretable Composed Image Retrieval via End-to-End Chain-of-Thought Reasoning

URL: http://arxiv.org/abs/2510.08003v1
Date: Thu, 09 Oct 2025 09:41:45 GMT
Title: CIR-CoT: Towards Interpretable Composed Image Retrieval via End-to-End Chain-of-Thought Reasoning
Authors: Weihuang Lin, Yiwei Ma, Jiayi Ji, Xiaoshuai Sun, Rongrong Ji,
Abstract summary: Composed Image Retrieval (CIR) aims to find a target image from a reference image and a modification text.<n>CIR-CoT is the first end-to-end retrieval-oriented MLLM designed to integrate explicit Chain-of-Thought (CoT) reasoning.
Score: 93.05917922306196
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Composed Image Retrieval (CIR), which aims to find a target image from a reference image and a modification text, presents the core challenge of performing unified reasoning across visual and semantic modalities. While current approaches based on Vision-Language Models (VLMs, e.g., CLIP) and more recent Multimodal Large Language Models (MLLMs, e.g., Qwen-VL) have shown progress, they predominantly function as ``black boxes." This inherent opacity not only prevents users from understanding the retrieval rationale but also restricts the models' ability to follow complex, fine-grained instructions. To overcome these limitations, we introduce CIR-CoT, the first end-to-end retrieval-oriented MLLM designed to integrate explicit Chain-of-Thought (CoT) reasoning. By compelling the model to first generate an interpretable reasoning chain, CIR-CoT enhances its ability to capture crucial cross-modal interactions, leading to more accurate retrieval while making its decision process transparent. Since existing datasets like FashionIQ and CIRR lack the necessary reasoning data, a key contribution of our work is the creation of structured CoT annotations using a three-stage process involving a caption, reasoning, and conclusion. Our model is then fine-tuned to produce this structured output before encoding its final retrieval intent into a dedicated embedding. Comprehensive experiments show that CIR-CoT achieves highly competitive performance on in-domain datasets (FashionIQ, CIRR) and demonstrates remarkable generalization on the out-of-domain CIRCO dataset, establishing a new path toward more effective and trustworthy retrieval systems.

Related papers

Beyond Unimodal Shortcuts: MLLMs as Cross-Modal Reasoners for Grounded Named Entity Recognition [51.68340973140949]
Multimodal Named Entity Recognition (GMNER) aims to extract text-based entities, assign them semantic categories, and ground them to corresponding visual regions.<n> MLLMs exhibit $textbfmodality bias$, including visual bias and textual bias, which stems from their tendency to take unimodal shortcuts.<n>We propose Modality-aware Consistency Reasoning ($bfMCR$), which enforces structured cross-modal reasoning.
arXiv Detail & Related papers (2026-02-04T12:12:49Z)
Generative Editing in the Joint Vision-Language Space for Zero-Shot Composed Image Retrieval [11.724675700368316]
Composed Image Retrieval (CIR) enables fine-grained visual search by combining a reference image with a textual modification.<n>We propose Fusion-Diff, a novel generative editing framework with high effectiveness and data efficiency designed for multimodal alignment.
arXiv Detail & Related papers (2025-12-01T13:04:55Z)
Vis-CoT: A Human-in-the-Loop Framework for Interactive Visualization and Intervention in LLM Chain-of-Thought Reasoning [0.13192560874022083]
We present Vis-CoT, a human-in-the-loop framework that converts linear chain-of-thought text into an interactive reasoning graph.<n>Users can visualize the logical flow, identify flawed steps, and intervene by pruning incorrect paths and grafting new, user-defined premises.<n> Vis-CoT improves final-answer accuracy by up to 24 percentage points over non-interactive baselines.
arXiv Detail & Related papers (2025-09-01T12:09:43Z)
ESTR-CoT: Towards Explainable and Accurate Event Stream based Scene Text Recognition with Chain-of-Thought Reasoning [57.767536707234036]
We propose a novel chain-of-thought reasoning based event stream scene text recognition framework, termed ESTR-CoT.<n>Specifically, we first adopt the vision encoder EVA-CLIP to transform the input event stream into tokens and utilize a Llama tokenizer to encode the given generation prompt.<n>A Q-former is used to align the vision token to the pre-trained large language model Vicuna-7B and output both the answer and chain-of-thought (CoT) reasoning process simultaneously.
arXiv Detail & Related papers (2025-07-02T23:41:31Z)
CoLLM: A Large Language Model for Composed Image Retrieval [76.29725148964368]
Composed Image Retrieval (CIR) is a complex task that aims to retrieve images based on a multimodal query.<n>We present CoLLM, a one-stop framework that generates triplets on-the-fly from image-caption pairs.<n>We leverage Large Language Models (LLMs) to generate joint embeddings of reference images and modification texts.
arXiv Detail & Related papers (2025-03-25T17:59:50Z)
CoTMR: Chain-of-Thought Multi-Scale Reasoning for Training-Free Zero-Shot Composed Image Retrieval [13.59418209417664]
Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve target images by integrating information from a composed query without training samples.<n>We propose CoTMR, a training-free framework crafted for ZS-CIR with novel Chain-of-thought (CoT) and Multi-scale Reasoning.
arXiv Detail & Related papers (2025-02-28T08:12:23Z)
Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints [15.541287957548771]
We propose a Coarse-to-fine Consistency Constraints Visual Grounding architecture.<n>It integrates implicit and explicit modeling approaches within a two-stage framework.<n>It significantly outperforms state-of-the-art REC and RIS methods by a substantial margin.
arXiv Detail & Related papers (2025-01-12T04:30:13Z)
Compositional Image Retrieval via Instruction-Aware Contrastive Learning [40.54022628032561]
Composed Image Retrieval (CIR) involves retrieving a target image based on a composed query of an image paired with text that specifies modifications or changes to the visual reference.<n>In practice, due to the scarcity of annotated data in downstream tasks, Zero-Shot CIR (ZS-CIR) is desirable.<n>We propose a novel embedding method utilizing an instruction-tuned Multimodal LLM (MLLM) to generate composed representation.
arXiv Detail & Related papers (2024-12-07T22:46:52Z)
Cantor: Inspiring Multimodal Chain-of-Thought of MLLM [83.6663322930814]
We argue that converging visual context acquisition and logical reasoning is pivotal for tackling visual reasoning tasks. We propose an innovative multimodal CoT framework, termed Cantor, characterized by a perception-decision architecture. Our experiments demonstrate the efficacy of the proposed framework, showing significant improvements in multimodal CoT performance.
arXiv Detail & Related papers (2024-04-24T17:59:48Z)
Towards More Unified In-context Visual Understanding [74.55332581979292]
We present a new ICL framework for visual understanding with multi-modal output enabled. First, we quantize and embed both text and visual prompt into a unified representational space. Then a decoder-only sparse transformer architecture is employed to perform generative modeling on them.
arXiv Detail & Related papers (2023-12-05T06:02:21Z)
Zero-shot Composed Text-Image Retrieval [72.43790281036584]
We consider the problem of composed image retrieval (CIR) It aims to train a model that can fuse multi-modal information, e.g., text and images, to accurately retrieve images that match the query, extending the user's expression ability.
arXiv Detail & Related papers (2023-06-12T17:56:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.