Chatting Makes Perfect: Chat-based Image Retrieval
- URL: http://arxiv.org/abs/2305.20062v2
- Date: Thu, 5 Oct 2023 16:40:02 GMT
- Title: Chatting Makes Perfect: Chat-based Image Retrieval
- Authors: Matan Levy, Rami Ben-Ari, Nir Darshan, Dani Lischinski
- Abstract summary: ChatIR is a chat-based image retrieval system that engages in a conversation with the user to elicit information.
Large Language Models are used to generate follow-up questions to an initial image description.
Our system is capable of retrieving the target image from a pool of 50K images with over 78% success rate after 5 dialogue rounds.
- Score: 25.452015862927766
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Chats emerge as an effective user-friendly approach for information
retrieval, and are successfully employed in many domains, such as customer
service, healthcare, and finance. However, existing image retrieval approaches
typically address the case of a single query-to-image round, and the use of
chats for image retrieval has been mostly overlooked. In this work, we
introduce ChatIR: a chat-based image retrieval system that engages in a
conversation with the user to elicit information, in addition to an initial
query, in order to clarify the user's search intent. Motivated by the
capabilities of today's foundation models, we leverage Large Language Models to
generate follow-up questions to an initial image description. These questions
form a dialog with the user in order to retrieve the desired image from a large
corpus. In this study, we explore the capabilities of such a system tested on a
large dataset and reveal that engaging in a dialog yields significant gains in
image retrieval. We start by building an evaluation pipeline from an existing
manually generated dataset and explore different modules and training
strategies for ChatIR. Our comparison includes strong baselines derived from
related applications trained with Reinforcement Learning. Our system is capable
of retrieving the target image from a pool of 50K images with over 78% success
rate after 5 dialogue rounds, compared to 75% when questions are asked by
humans, and 64% for a single shot text-to-image retrieval. Extensive
evaluations reveal the strong capabilities and examine the limitations of
CharIR under different settings. Project repository is available at
https://github.com/levymsn/ChatIR.
Related papers
- ChatSearch: a Dataset and a Generative Retrieval Model for General Conversational Image Retrieval [31.663016521987764]
We investigate the task of general conversational image retrieval on open-domain images.
To advance this task, we curate a dataset called ChatSearch.
This dataset includes a multi-round multimodal conversational context query for each target image.
We propose a generative retrieval model named ChatSearcher, which is trained end-to-end to accept/produce interleaved image-text inputs/outputs.
arXiv Detail & Related papers (2024-10-24T13:19:22Z) - Unified Text-to-Image Generation and Retrieval [96.72318842152148]
We propose a unified framework in the context of Multimodal Large Language Models (MLLMs)
We first explore the intrinsic discrimi abilities of MLLMs and introduce a generative retrieval method to perform retrieval in a training-free manner.
We then unify generation and retrieval in an autoregressive generation way and propose an autonomous decision module to choose the best-matched one between generated and retrieved images.
arXiv Detail & Related papers (2024-06-09T15:00:28Z) - Interactive Text-to-Image Retrieval with Large Language Models: A Plug-and-Play Approach [33.231639257323536]
In this paper, we address the issue of dialogue-form context query within the interactive text-to-image retrieval task.
By reformulating the dialogue-form context, we eliminate the necessity of fine-tuning a retrieval model on existing visual dialogue data.
We construct the LLM questioner to generate non-redundant questions about the attributes of the target image.
arXiv Detail & Related papers (2024-06-05T16:09:01Z) - Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models [17.171715290673678]
We propose an interactive image retrieval system capable of refining queries based on user relevance feedback.
This system incorporates a vision language model (VLM) based image captioner to enhance the quality of text-based queries.
To evaluate our system, we curate a new dataset by adapting the MSR-VTT video retrieval dataset to the image retrieval task.
arXiv Detail & Related papers (2024-04-29T14:46:35Z) - GeoChat: Grounded Large Vision-Language Model for Remote Sensing [65.78360056991247]
We propose GeoChat - the first versatile remote sensing Large Vision-Language Models (VLMs) that offers multitask conversational capabilities with high-resolution RS images.
Specifically, GeoChat can answer image-level queries but also accepts region inputs to hold region-specific dialogue.
GeoChat demonstrates robust zero-shot performance on various RS tasks, e.g., image and region captioning, visual question answering, scene classification, visually grounded conversations and referring detection.
arXiv Detail & Related papers (2023-11-24T18:59:10Z) - Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models [60.81438804824749]
Multimodal instruction-following models extend capabilities by integrating both text and images.
Existing models such as MiniGPT-4 and LLaVA face challenges in maintaining dialogue coherence in scenarios involving multiple images.
We introduce SparklesDialogue, the first machine-generated dialogue dataset tailored for word-level interleaved multi-image and text interactions.
We then present SparklesChat, a multimodal instruction-following model for open-ended dialogues across multiple images.
arXiv Detail & Related papers (2023-08-31T05:15:27Z) - End-to-end Knowledge Retrieval with Multi-modal Queries [50.01264794081951]
ReMuQ requires a system to retrieve knowledge from a large corpus by integrating contents from both text and image queries.
We introduce a retriever model ReViz'' that can directly process input text and images to retrieve relevant knowledge in an end-to-end fashion.
We demonstrate superior performance in retrieval on two datasets under zero-shot settings.
arXiv Detail & Related papers (2023-06-01T08:04:12Z) - FCC: Fusing Conversation History and Candidate Provenance for Contextual
Response Ranking in Dialogue Systems [53.89014188309486]
We present a flexible neural framework that can integrate contextual information from multiple channels.
We evaluate our model on the MSDialog dataset widely used for evaluating conversational response ranking tasks.
arXiv Detail & Related papers (2023-03-31T23:58:28Z) - Part2Whole: Iteratively Enrich Detail for Cross-Modal Retrieval with
Partial Query [25.398090300086302]
We propose an interactive retrieval framework called Part2Whole to tackle this problem.
An Interactive Retrieval Agent is trained to build an optimal policy to refine the initial query.
We present a weakly-supervised reinforcement learning method that needs no human-annotated data other than the text-image dataset.
arXiv Detail & Related papers (2021-03-02T11:27:05Z) - ORD: Object Relationship Discovery for Visual Dialogue Generation [60.471670447176656]
We propose an object relationship discovery (ORD) framework to preserve the object interactions for visual dialogue generation.
A hierarchical graph convolutional network (HierGCN) is proposed to retain the object nodes and neighbour relationships locally, and then refines the object-object connections globally.
Experiments have proved that the proposed method can significantly improve the quality of dialogue by utilising the contextual information of visual relationships.
arXiv Detail & Related papers (2020-06-15T12:25:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.