Part2Whole: Iteratively Enrich Detail for Cross-Modal Retrieval with
Partial Query
- URL: http://arxiv.org/abs/2103.01654v1
- Date: Tue, 2 Mar 2021 11:27:05 GMT
- Title: Part2Whole: Iteratively Enrich Detail for Cross-Modal Retrieval with
Partial Query
- Authors: Guanyu Cai, Xinyang Jiang, Jun Zhang, Yifei Gong, Lianghua He, Pai
Peng, Xiaowei Guo, Xing Sun
- Abstract summary: We propose an interactive retrieval framework called Part2Whole to tackle this problem.
An Interactive Retrieval Agent is trained to build an optimal policy to refine the initial query.
We present a weakly-supervised reinforcement learning method that needs no human-annotated data other than the text-image dataset.
- Score: 25.398090300086302
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-based image retrieval has seen considerable progress in recent years.
However, the performance of existing methods suffers in real life since the
user is likely to provide an incomplete description of a complex scene, which
often leads to results filled with false positives that fit the incomplete
description. In this work, we introduce the partial-query problem and
extensively analyze its influence on text-based image retrieval. We then
propose an interactive retrieval framework called Part2Whole to tackle this
problem by iteratively enriching the missing details. Specifically, an
Interactive Retrieval Agent is trained to build an optimal policy to refine the
initial query based on a user-friendly interaction and statistical
characteristics of the gallery. Compared to other dialog-based methods that
rely heavily on the user to feed back differentiating information, we let AI
take over the optimal feedback searching process and hint the user with
confirmation-based questions about details. Furthermore, since fully-supervised
training is often infeasible due to the difficulty of obtaining human-machine
dialog data, we present a weakly-supervised reinforcement learning method that
needs no human-annotated data other than the text-image dataset. Experiments
show that our framework significantly improves the performance of text-based
image retrieval under complex scenes.
Related papers
- Enhancing Multimodal Query Representation via Visual Dialogues for End-to-End Knowledge Retrieval [26.585985828583304]
We propose an end-to-end multimodal retrieval system, Ret-XKnow, to endow a text retriever with the ability to understand multimodal queries.
To effectively learn multimodal interaction, we also introduce the Visual Dialogue-to-Retrieval dataset automatically constructed from visual dialogue datasets.
We demonstrate that our approach not only significantly improves retrieval performance in zero-shot settings but also achieves substantial improvements in fine-tuning scenarios.
arXiv Detail & Related papers (2024-11-13T04:32:58Z) - Unified Text-to-Image Generation and Retrieval [96.72318842152148]
We propose a unified framework in the context of Multimodal Large Language Models (MLLMs)
We first explore the intrinsic discrimi abilities of MLLMs and introduce a generative retrieval method to perform retrieval in a training-free manner.
We then unify generation and retrieval in an autoregressive generation way and propose an autonomous decision module to choose the best-matched one between generated and retrieved images.
arXiv Detail & Related papers (2024-06-09T15:00:28Z) - Interactive Text-to-Image Retrieval with Large Language Models: A Plug-and-Play Approach [33.231639257323536]
In this paper, we address the issue of dialogue-form context query within the interactive text-to-image retrieval task.
By reformulating the dialogue-form context, we eliminate the necessity of fine-tuning a retrieval model on existing visual dialogue data.
We construct the LLM questioner to generate non-redundant questions about the attributes of the target image.
arXiv Detail & Related papers (2024-06-05T16:09:01Z) - You'll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval [120.49126407479717]
We introduce a novel compositionality framework, effectively combining sketches and text using pre-trained CLIP models.
Our system extends to novel applications in composed image retrieval, domain transfer, and fine-grained generation.
arXiv Detail & Related papers (2024-03-12T00:27:18Z) - LLM Blueprint: Enabling Text-to-Image Generation with Complex and
Detailed Prompts [60.54912319612113]
Diffusion-based generative models have significantly advanced text-to-image generation but encounter challenges when processing lengthy and intricate text prompts.
We present a novel approach leveraging Large Language Models (LLMs) to extract critical components from text prompts.
Our evaluation on complex prompts featuring multiple objects demonstrates a substantial improvement in recall compared to baseline diffusion models.
arXiv Detail & Related papers (2023-10-16T17:57:37Z) - Progressive Learning for Image Retrieval with Hybrid-Modality Queries [48.79599320198615]
Image retrieval with hybrid-modality queries, also known as composing text and image for image retrieval (CTI-IR)
We decompose the CTI-IR task into a three-stage learning problem to progressively learn the complex knowledge for image retrieval with hybrid-modality queries.
Our proposed model significantly outperforms state-of-the-art methods in the mean of Recall@K by 24.9% and 9.5% on the Fashion-IQ and Shoes benchmark datasets respectively.
arXiv Detail & Related papers (2022-04-24T08:10:06Z) - Where Does the Performance Improvement Come From? - A Reproducibility
Concern about Image-Text Retrieval [85.03655458677295]
Image-text retrieval has gradually become a major research direction in the field of information retrieval.
We first examine the related concerns and why the focus is on image-text retrieval tasks.
We analyze various aspects of the reproduction of pretrained and nonpretrained retrieval models.
arXiv Detail & Related papers (2022-03-08T05:01:43Z) - Text-based Person Search in Full Images via Semantic-Driven Proposal
Generation [42.25611020956918]
We propose a new end-to-end learning framework which jointly optimize the pedestrian detection, identification and visual-semantic feature embedding tasks.
To take full advantage of the query text, the semantic features are leveraged to instruct the Region Proposal Network to pay more attention to the text-described proposals.
arXiv Detail & Related papers (2021-09-27T11:42:40Z) - Connecting Images through Time and Sources: Introducing Low-data,
Heterogeneous Instance Retrieval [3.6526118822907594]
We show that it is not trivial to pick features responding well to a panel of variations and semantic content.
Introducing a new enhanced version of the Alegoria benchmark, we compare descriptors using the detailed annotations.
arXiv Detail & Related papers (2021-03-19T10:54:51Z) - From A Glance to "Gotcha": Interactive Facial Image Retrieval with
Progressive Relevance Feedback [72.29919762941029]
We propose an end-to-end framework to retrieve facial images with relevance feedback progressively provided by the witness.
With no need of any extra annotations, our model can be applied at the cost of a little response effort.
arXiv Detail & Related papers (2020-07-30T18:46:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.