Related papers: DIR-TIR: Dialog-Iterative Refinement for Text-to-Image Retrieval

DIR-TIR: Dialog-Iterative Refinement for Text-to-Image Retrieval

URL: http://arxiv.org/abs/2511.14449v1
Date: Tue, 18 Nov 2025 12:45:10 GMT
Title: DIR-TIR: Dialog-Iterative Refinement for Text-to-Image Retrieval
Authors: Zongwei Zhen, Biqing Zeng,
Abstract summary: Our framework progressively refines the target image search through two specialized modules.<n>The Dialog Refiner actively queries users to extract essential information and generate increasingly precise descriptions.<n>The Image Refiner identifies gaps between generated images and user intentions, strategically reducing the visual-semantic discrepancy.
Score: 3.5092739016434567
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper addresses the task of interactive, conversational text-to-image retrieval. Our DIR-TIR framework progressively refines the target image search through two specialized modules: the Dialog Refiner Module and the Image Refiner Module. The Dialog Refiner actively queries users to extract essential information and generate increasingly precise descriptions of the target image. Complementarily, the Image Refiner identifies perceptual gaps between generated images and user intentions, strategically reducing the visual-semantic discrepancy. By leveraging multi-turn dialogues, DIR-TIR provides superior controllability and fault tolerance compared to conventional single-query methods, significantly improving target image hit accuracy. Comprehensive experiments across diverse image datasets demonstrate our dialogue-based approach substantially outperforms initial-description-only baselines, while the synergistic module integration achieves both higher retrieval precision and enhanced interactive experience.

Related papers

Fine-Grained Zero-Shot Composed Image Retrieval with Complementary Visual-Semantic Integration [64.12127577975696]
Zero-shot composed image retrieval (ZS-CIR) is a rapidly growing area with significant practical applications.<n>Existing ZS-CIR methods often struggle to capture fine-grained changes and integrate visual and semantic information effectively.<n>We propose a novel Fine-Grained Zero-Shot Composed Image Retrieval method with Complementary Visual-Semantic Integration.
arXiv Detail & Related papers (2026-01-20T15:17:14Z)
DetailFusion: A Dual-branch Framework with Detail Enhancement for Composed Image Retrieval [51.30915462824879]
Composed Image Retrieval (CIR) aims to retrieve target images from a gallery based on a reference image and modification text as a combined query.<n>Recent approaches focus on balancing global information from two modalities and encode the query into a unified feature for retrieval.<n>We propose DetailFusion, a novel dual-branch framework that effectively coordinates information across global and detailed granularities.
arXiv Detail & Related papers (2025-05-23T12:15:23Z)
RSRefSeg: Referring Remote Sensing Image Segmentation with Foundation Models [24.67117013862316]
Referring remote sensing image segmentation is crucial for achieving fine-grained visual understanding.<n>We introduce a referring remote sensing image segmentation foundational model, RSRefSeg.<n> Experimental results on the RRSIS-D dataset demonstrate that RSRefSeg outperforms existing methods.
arXiv Detail & Related papers (2025-01-12T13:22:35Z)
BI-MDRG: Bridging Image History in Multimodal Dialogue Response Generation [21.052101309555464]
Multimodal Dialogue Response Generation (MDRG) is a recently proposed task where the model needs to generate responses in texts, images, or a blend of both. Previous work relies on the text modality as an intermediary step for both the image input and output of the model rather than adopting an end-to-end approach. We propose BI-MDRG that bridges the response generation path such that the image history information is utilized for enhanced relevance of text responses to the image content.
arXiv Detail & Related papers (2024-08-12T05:22:42Z)
Visualizing Dialogues: Enhancing Image Selection through Dialogue Understanding with Large Language Models [25.070424546200293]
We present a novel approach leveraging the robust reasoning capabilities of large language models (LLMs) to generate precise dialogue-associated visual descriptors. Experiments conducted on benchmark data validate the effectiveness of our proposed approach in deriving concise and accurate visual descriptors. Our findings demonstrate the method's generalizability across diverse visual cues, various LLMs, and different datasets.
arXiv Detail & Related papers (2024-07-04T03:50:30Z)
Interactive Text-to-Image Retrieval with Large Language Models: A Plug-and-Play Approach [33.231639257323536]
In this paper, we address the issue of dialogue-form context query within the interactive text-to-image retrieval task. By reformulating the dialogue-form context, we eliminate the necessity of fine-tuning a retrieval model on existing visual dialogue data. We construct the LLM questioner to generate non-redundant questions about the attributes of the target image.
arXiv Detail & Related papers (2024-06-05T16:09:01Z)
Sentence-level Prompts Benefit Composed Image Retrieval [69.78119883060006]
Composed image retrieval (CIR) is the task of retrieving specific images by using a query that involves both a reference image and a relative caption. We propose to leverage pretrained V-L models, e.g., BLIP-2, to generate sentence-level prompts. Our proposed method performs favorably against the state-of-the-art CIR methods on the Fashion-IQ and CIRR datasets.
arXiv Detail & Related papers (2023-10-09T07:31:44Z)
IR-GAN: Image Manipulation with Linguistic Instruction by Increment Reasoning [110.7118381246156]
Increment Reasoning Generative Adversarial Network (IR-GAN) aims to reason consistency between visual increment in images and semantic increment in instructions. First, we introduce the word-level and instruction-level instruction encoders to learn user's intention from history-correlated instructions as semantic increment. Second, we embed the representation of semantic increment into that of source image for generating target image, where source image plays the role of referring auxiliary.
arXiv Detail & Related papers (2022-04-02T07:48:39Z)
Two-stage Visual Cues Enhancement Network for Referring Image Segmentation [89.49412325699537]
Referring Image (RIS) aims at segmenting the target object from an image referred by one given natural language expression. In this paper, we tackle this problem by devising a Two-stage Visual cues enhancement Network (TV-Net) Through the two-stage enhancement, our proposed TV-Net enjoys better performances in learning fine-grained matching behaviors between the natural language expression and image.
arXiv Detail & Related papers (2021-10-09T02:53:39Z)
DAE-GAN: Dynamic Aspect-aware GAN for Text-to-Image Synthesis [55.788772366325105]
We propose a Dynamic Aspect-awarE GAN (DAE-GAN) that represents text information comprehensively from multiple granularities, including sentence-level, word-level, and aspect-level. Inspired by human learning behaviors, we develop a novel Aspect-aware Dynamic Re-drawer (ADR) for image refinement, in which an Attended Global Refinement (AGR) module and an Aspect-aware Local Refinement (ALR) module are alternately employed.
arXiv Detail & Related papers (2021-08-27T07:20:34Z)
Part2Whole: Iteratively Enrich Detail for Cross-Modal Retrieval with Partial Query [25.398090300086302]
We propose an interactive retrieval framework called Part2Whole to tackle this problem. An Interactive Retrieval Agent is trained to build an optimal policy to refine the initial query. We present a weakly-supervised reinforcement learning method that needs no human-annotated data other than the text-image dataset.
arXiv Detail & Related papers (2021-03-02T11:27:05Z)
Referring Image Segmentation via Cross-Modal Progressive Comprehension [94.70482302324704]
Referring image segmentation aims at segmenting the foreground masks of the entities that can well match the description given in the natural language expression. Previous approaches tackle this problem using implicit feature interaction and fusion between visual and linguistic modalities. We propose a Cross-Modal Progressive (CMPC) module and a Text-Guided Feature Exchange (TGFE) module to effectively address the challenging task.
arXiv Detail & Related papers (2020-10-01T16:02:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.