UniIR: Training and Benchmarking Universal Multimodal Information
Retrievers
- URL: http://arxiv.org/abs/2311.17136v1
- Date: Tue, 28 Nov 2023 18:55:52 GMT
- Title: UniIR: Training and Benchmarking Universal Multimodal Information
Retrievers
- Authors: Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan
Ritter, Wenhu Chen
- Abstract summary: We introduce UniIR, a unified instruction-guided multimodal retriever capable of handling eight distinct retrieval tasks across modalities.
UniIR, a single retrieval system jointly trained on ten diverse multimodal-IR datasets, interprets user instructions to execute various retrieval tasks.
We construct the M-BEIR, a multimodal retrieval benchmark with comprehensive results, to standardize the evaluation of universal multimodal information retrieval.
- Score: 76.06249845401975
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing information retrieval (IR) models often assume a homogeneous format,
limiting their applicability to diverse user needs, such as searching for
images with text descriptions, searching for a news article with a headline
image, or finding a similar photo with a query image. To approach such
different information-seeking demands, we introduce UniIR, a unified
instruction-guided multimodal retriever capable of handling eight distinct
retrieval tasks across modalities. UniIR, a single retrieval system jointly
trained on ten diverse multimodal-IR datasets, interprets user instructions to
execute various retrieval tasks, demonstrating robust performance across
existing datasets and zero-shot generalization to new tasks. Our experiments
highlight that multi-task training and instruction tuning are keys to UniIR's
generalization ability. Additionally, we construct the M-BEIR, a multimodal
retrieval benchmark with comprehensive results, to standardize the evaluation
of universal multimodal information retrieval.
Related papers
- MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs [78.5013630951288]
This paper introduces techniques for advancing information retrieval with multimodal large language models (MLLMs)
We first study fine-tuning an MLLM as a bi-encoder retriever on 10 datasets with 16 retrieval tasks.
We propose modality-aware hard negative mining to mitigate the modality bias exhibited by MLLM retrievers.
arXiv Detail & Related papers (2024-11-04T20:06:34Z) - Instruct-ReID++: Towards Universal Purpose Instruction-Guided Person Re-identification [62.894790379098005]
We propose a novel instruct-ReID task that requires the model to retrieve images according to the given image or language instructions.
Instruct-ReID is the first exploration of a general ReID setting, where existing 6 ReID tasks can be viewed as special cases by assigning different instructions.
We propose a novel baseline model, IRM, with an adaptive triplet loss to handle various retrieval tasks within a unified framework.
arXiv Detail & Related papers (2024-05-28T03:35:46Z) - Decoupling Common and Unique Representations for Multimodal Self-supervised Learning [22.12729786091061]
We propose Decoupling Common and Unique Representations (DeCUR), a simple yet effective method for multimodal self-supervised learning.
By distinguishing inter- and intra-modal embeddings through multimodal redundancy reduction, DeCUR can integrate complementary information across different modalities.
arXiv Detail & Related papers (2023-09-11T08:35:23Z) - End-to-end Knowledge Retrieval with Multi-modal Queries [50.01264794081951]
ReMuQ requires a system to retrieve knowledge from a large corpus by integrating contents from both text and image queries.
We introduce a retriever model ReViz'' that can directly process input text and images to retrieve relevant knowledge in an end-to-end fashion.
We demonstrate superior performance in retrieval on two datasets under zero-shot settings.
arXiv Detail & Related papers (2023-06-01T08:04:12Z) - Named Entity and Relation Extraction with Multi-Modal Retrieval [51.660650522630526]
Multi-modal named entity recognition (NER) and relation extraction (RE) aim to leverage relevant image information to improve the performance of NER and RE.
We propose a novel Multi-modal Retrieval based framework (MoRe)
MoRe contains a text retrieval module and an image-based retrieval module, which retrieve related knowledge of the input text and image in the knowledge corpus respectively.
arXiv Detail & Related papers (2022-12-03T13:11:32Z) - Probabilistic Compositional Embeddings for Multimodal Image Retrieval [48.450232527041436]
We investigate a more challenging scenario for composing multiple multimodal queries in image retrieval.
Given an arbitrary number of query images and (or) texts, our goal is to retrieve target images containing the semantic concepts specified in multiple multimodal queries.
We propose a novel multimodal probabilistic composer (MPC) to learn an informative embedding that can flexibly encode the semantics of various queries.
arXiv Detail & Related papers (2022-04-12T14:45:37Z) - Single-Modal Entropy based Active Learning for Visual Question Answering [75.1682163844354]
We address Active Learning in the multi-modal setting of Visual Question Answering (VQA)
In light of the multi-modal inputs, image and question, we propose a novel method for effective sample acquisition.
Our novel idea is simple to implement, cost-efficient, and readily adaptable to other multi-modal tasks.
arXiv Detail & Related papers (2021-10-21T05:38:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.