The Contemporary Art of Image Search: Iterative User Intent Expansion
via Vision-Language Model
- URL: http://arxiv.org/abs/2312.01656v2
- Date: Tue, 5 Dec 2023 02:24:38 GMT
- Title: The Contemporary Art of Image Search: Iterative User Intent Expansion
via Vision-Language Model
- Authors: Yilin Ye, Qian Zhu, Shishi Xiao, Kang Zhang, Wei Zeng
- Abstract summary: We introduce an innovative user intent expansion framework for image search.
Our framework leverages visual-language models to parse and compose multi-modal user inputs.
The proposed framework significantly improves users' image search experience.
- Score: 4.531548217880843
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Image search is an essential and user-friendly method to explore vast
galleries of digital images. However, existing image search methods heavily
rely on proximity measurements like tag matching or image similarity, requiring
precise user inputs for satisfactory results. To meet the growing demand for a
contemporary image search engine that enables accurate comprehension of users'
search intentions, we introduce an innovative user intent expansion framework.
Our framework leverages visual-language models to parse and compose multi-modal
user inputs to provide more accurate and satisfying results. It comprises
two-stage processes: 1) a parsing stage that incorporates a language parsing
module with large language models to enhance the comprehension of textual
inputs, along with a visual parsing module that integrates an interactive
segmentation module to swiftly identify detailed visual elements within images;
and 2) a logic composition stage that combines multiple user search intents
into a unified logic expression for more sophisticated operations in complex
searching scenarios. Moreover, the intent expansion framework enables users to
perform flexible contextualized interactions with the search results to further
specify or adjust their detailed search intents iteratively. We implemented the
framework into an image search system for NFT (non-fungible token) search and
conducted a user study to evaluate its usability and novel properties. The
results indicate that the proposed framework significantly improves users'
image search experience. Particularly the parsing and contextualized
interactions prove useful in allowing users to express their search intents
more accurately and engage in a more enjoyable iterative search experience.
Related papers
- Leveraging Large Language Models for Multimodal Search [0.6249768559720121]
This paper introduces a novel multimodal search model that achieves a new performance milestone on the Fashion200K dataset.
We also propose a novel search interface integrating Large Language Models (LLMs) to facilitate natural language interaction.
arXiv Detail & Related papers (2024-04-24T10:30:42Z) - You'll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval [120.49126407479717]
We introduce a novel compositionality framework, effectively combining sketches and text using pre-trained CLIP models.
Our system extends to novel applications in composed image retrieval, domain transfer, and fine-grained generation.
arXiv Detail & Related papers (2024-03-12T00:27:18Z) - Large Language Models for Captioning and Retrieving Remote Sensing
Images [4.499596985198142]
RS-CapRet is a Vision and Language method for remote sensing tasks.
It can generate descriptions for remote sensing images and retrieve images from textual descriptions.
arXiv Detail & Related papers (2024-02-09T15:31:01Z) - PromptMagician: Interactive Prompt Engineering for Text-to-Image
Creation [16.41459454076984]
This research proposes PromptMagician, a visual analysis system that helps users explore the image results and refine the input prompts.
The backbone of our system is a prompt recommendation model that takes user prompts as input, retrieves similar prompt-image pairs from DiffusionDB, and identifies special (important and relevant) prompt keywords.
arXiv Detail & Related papers (2023-07-18T07:46:25Z) - Contextual Object Detection with Multimodal Large Language Models [66.15566719178327]
We introduce a novel research problem of contextual object detection.
Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering.
We present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts.
arXiv Detail & Related papers (2023-05-29T17:50:33Z) - EDIS: Entity-Driven Image Search over Multimodal Web Content [95.40238328527931]
We introduce textbfEntity-textbfDriven textbfImage textbfSearch (EDIS), a dataset for cross-modal image search in the news domain.
EDIS consists of 1 million web images from actual search engine results and curated datasets, with each image paired with a textual description.
arXiv Detail & Related papers (2023-05-23T02:59:19Z) - Named Entity and Relation Extraction with Multi-Modal Retrieval [51.660650522630526]
Multi-modal named entity recognition (NER) and relation extraction (RE) aim to leverage relevant image information to improve the performance of NER and RE.
We propose a novel Multi-modal Retrieval based framework (MoRe)
MoRe contains a text retrieval module and an image-based retrieval module, which retrieve related knowledge of the input text and image in the knowledge corpus respectively.
arXiv Detail & Related papers (2022-12-03T13:11:32Z) - ARTEMIS: Attention-based Retrieval with Text-Explicit Matching and
Implicit Similarity [16.550790981646276]
Current approaches combine the features of each of the two elements of the query into a single representation.
Our work aims at shedding new light on the task by looking at it through the prism of two familiar and related frameworks: text-to-image and image-to-image retrieval.
arXiv Detail & Related papers (2022-03-15T17:29:20Z) - Telling the What while Pointing the Where: Fine-grained Mouse Trace and
Language Supervision for Improved Image Retrieval [60.24860627782486]
Fine-grained image retrieval often requires the ability to also express the where in the image the content they are looking for is.
In this paper, we describe an image retrieval setup where the user simultaneously describes an image using both spoken natural language (the "what") and mouse traces over an empty canvas (the "where")
Our model is capable of taking this spatial guidance into account, and provides more accurate retrieval results compared to text-only equivalent systems.
arXiv Detail & Related papers (2021-02-09T17:54:34Z) - SAC: Semantic Attention Composition for Text-Conditioned Image Retrieval [15.074592583852167]
We focus on the task of text-conditioned image retrieval that utilizes support text feedback alongside a reference image to retrieve images.
We propose a novel framework SAC which resolves the above in two major steps: "where to see" (Semantic Feature Attention) and "how to change"
We show how our architecture streamlines the generation of text-aware image features by removing the need for various modules required by other state-of-art techniques.
arXiv Detail & Related papers (2020-09-03T06:55:23Z) - Sequential Gallery for Interactive Visual Design Optimization [51.52002870143971]
We propose a novel user-in-the-loop optimization method that allows users to efficiently find an appropriate parameter set.
We also propose using a gallery-based interface that provides options in the two-dimensional subspace arranged in an adaptive grid view.
Our experiment with synthetic functions shows that our sequential plane search can find satisfactory solutions in fewer iterations than baselines.
arXiv Detail & Related papers (2020-05-08T15:24:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.