End-to-end Knowledge Retrieval with Multi-modal Queries
- URL: http://arxiv.org/abs/2306.00424v1
- Date: Thu, 1 Jun 2023 08:04:12 GMT
- Title: End-to-end Knowledge Retrieval with Multi-modal Queries
- Authors: Man Luo, Zhiyuan Fang, Tejas Gokhale, Yezhou Yang, Chitta Baral
- Abstract summary: ReMuQ requires a system to retrieve knowledge from a large corpus by integrating contents from both text and image queries.
We introduce a retriever model ReViz'' that can directly process input text and images to retrieve relevant knowledge in an end-to-end fashion.
We demonstrate superior performance in retrieval on two datasets under zero-shot settings.
- Score: 50.01264794081951
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We investigate knowledge retrieval with multi-modal queries, i.e. queries
containing information split across image and text inputs, a challenging task
that differs from previous work on cross-modal retrieval. We curate a new
dataset called ReMuQ for benchmarking progress on this task. ReMuQ requires a
system to retrieve knowledge from a large corpus by integrating contents from
both text and image queries. We introduce a retriever model ``ReViz'' that can
directly process input text and images to retrieve relevant knowledge in an
end-to-end fashion without being dependent on intermediate modules such as
object detectors or caption generators. We introduce a new pretraining task
that is effective for learning knowledge retrieval with multimodal queries and
also improves performance on downstream tasks. We demonstrate superior
performance in retrieval on two datasets (ReMuQ and OK-VQA) under zero-shot
settings as well as further improvements when finetuned on these datasets.
Related papers
- MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs [78.5013630951288]
This paper introduces techniques for advancing information retrieval with multimodal large language models (MLLMs)
We first study fine-tuning an MLLM as a bi-encoder retriever on 10 datasets with 16 retrieval tasks.
We propose modality-aware hard negative mining to mitigate the modality bias exhibited by MLLM retrievers.
arXiv Detail & Related papers (2024-11-04T20:06:34Z) - RoRA-VLM: Robust Retrieval-Augmented Vision Language Models [41.09545760534495]
RORA-VLM is a novel and robust retrieval augmentation framework specifically tailored for vision-language models.
We conduct extensive experiments to validate the effectiveness and robustness of our proposed methods on three widely adopted benchmark datasets.
arXiv Detail & Related papers (2024-10-11T14:51:00Z) - Generative Multi-Modal Knowledge Retrieval with Large Language Models [75.70313858231833]
We propose an innovative end-to-end generative framework for multi-modal knowledge retrieval.
Our framework takes advantage of the fact that large language models (LLMs) can effectively serve as virtual knowledge bases.
We demonstrate significant improvements ranging from 3.0% to 14.6% across all evaluation metrics when compared to strong baselines.
arXiv Detail & Related papers (2024-01-16T08:44:29Z) - Query Rewriting for Retrieval-Augmented Large Language Models [139.242907155883]
Large Language Models (LLMs) play powerful, black-box readers in the retrieve-then-read pipeline.
This work introduces a new framework, Rewrite-Retrieve-Read instead of the previous retrieve-then-read for the retrieval-augmented LLMs.
arXiv Detail & Related papers (2023-05-23T17:27:50Z) - Multimodal Inverse Cloze Task for Knowledge-based Visual Question
Answering [4.114444605090133]
We present a new pre-training method, Multimodal Inverse Cloze Task, for Knowledge-based Visual Question Answering about named Entities.
KVQAE is a recently introduced task that consists in answering questions about named entities grounded in a visual context using a Knowledge Base.
Our method is applicable to different neural network architectures and leads to a 9% relative-MRR and 15% relative-F1 gain for retrieval and reading comprehension.
arXiv Detail & Related papers (2023-01-11T09:16:34Z) - Named Entity and Relation Extraction with Multi-Modal Retrieval [51.660650522630526]
Multi-modal named entity recognition (NER) and relation extraction (RE) aim to leverage relevant image information to improve the performance of NER and RE.
We propose a novel Multi-modal Retrieval based framework (MoRe)
MoRe contains a text retrieval module and an image-based retrieval module, which retrieve related knowledge of the input text and image in the knowledge corpus respectively.
arXiv Detail & Related papers (2022-12-03T13:11:32Z) - Progressive Learning for Image Retrieval with Hybrid-Modality Queries [48.79599320198615]
Image retrieval with hybrid-modality queries, also known as composing text and image for image retrieval (CTI-IR)
We decompose the CTI-IR task into a three-stage learning problem to progressively learn the complex knowledge for image retrieval with hybrid-modality queries.
Our proposed model significantly outperforms state-of-the-art methods in the mean of Recall@K by 24.9% and 9.5% on the Fashion-IQ and Shoes benchmark datasets respectively.
arXiv Detail & Related papers (2022-04-24T08:10:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.