Related papers: MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs

MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs

URL: http://arxiv.org/abs/2411.02571v1
Date: Mon, 04 Nov 2024 20:06:34 GMT
Title: MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs
Authors: Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catanzaro, Wei Ping,
Abstract summary: This paper introduces techniques for advancing information retrieval with multimodal large language models (MLLMs) We first study fine-tuning an MLLM as a bi-encoder retriever on 10 datasets with 16 retrieval tasks. We propose modality-aware hard negative mining to mitigate the modality bias exhibited by MLLM retrievers.
Score: 78.5013630951288
License: http://creativecommons.org/licenses/by/4.0/
Abstract: State-of-the-art retrieval models typically address a straightforward search scenario, where retrieval tasks are fixed (e.g., finding a passage to answer a specific question) and only a single modality is supported for both queries and retrieved results. This paper introduces techniques for advancing information retrieval with multimodal large language models (MLLMs), enabling a broader search scenario, termed universal multimodal retrieval, where multiple modalities and diverse retrieval tasks are accommodated. To this end, we first study fine-tuning an MLLM as a bi-encoder retriever on 10 datasets with 16 retrieval tasks. Our empirical results show that the fine-tuned MLLM retriever is capable of understanding challenging queries, composed of both text and image, but underperforms a smaller CLIP retriever in cross-modal retrieval tasks due to modality bias from MLLMs. To address the issue, we propose modality-aware hard negative mining to mitigate the modality bias exhibited by MLLM retrievers. Second, we propose to continually fine-tune the universal multimodal retriever to enhance its text retrieval capability while maintaining multimodal retrieval capability. As a result, our model, MM-Embed, achieves state-of-the-art performance on the multimodal retrieval benchmark M-BEIR, which spans multiple domains and tasks, while also surpassing the state-of-the-art text retrieval model, NV-Embed-v1, on MTEB retrieval benchmark. Finally, we explore to prompt the off-the-shelf MLLMs as the zero-shot rerankers to refine the ranking of the candidates from the multimodal retriever. We find that through prompt-and-reranking, MLLMs can further improve multimodal retrieval when the user queries (e.g., text-image composed queries) are more complex and challenging to understand. These findings also pave the way to advance universal multimodal retrieval in the future.

Related papers

Beyond Global Similarity: Towards Fine-Grained, Multi-Condition Multimodal Retrieval [27.493644447594367]
MCMR (Multi-Conditional Multimodal Retrieval) is a large-scale benchmark designed to evaluate fine-grained, multi-condition cross-modal retrieval under natural-language queries.<n>It spans five product domains: upper and bottom clothing, jewelry, shoes, and furniture.<n>We benchmark a diverse suite of MLLM-based multimodal retrievers and vision-language rerankers to assess their condition-aware reasoning abilities.
arXiv Detail & Related papers (2026-03-01T12:53:47Z)
VIRTUE: Versatile Video Retrieval Through Unified Embeddings [6.517174336539377]
We present VIRTUE, an MLLM-based versatile video retrieval framework that integrates corpus and moment-level retrieval capabilities.<n>We use contrastive alignment of visual and textual embeddings generated using a shared MLLM backbone to facilitate efficient embedding-based candidate search.<n>Our embedding model, trained efficiently using low-rank adaptation (LoRA) on 700K paired visual-text data samples, surpasses other MLLM-based methods on zero-shot video retrieval tasks.
arXiv Detail & Related papers (2026-01-17T23:13:38Z)
DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search [61.77858432092777]
We present DeepMMSearch-R1, the first multimodal large language model capable of performing on-demand, multi-turn web searches.<n>DeepMMSearch-R1 can initiate web searches based on relevant crops of the input image making the image search more effective.<n>We conduct extensive experiments across a range of knowledge-intensive benchmarks to demonstrate the superiority of our approach.
arXiv Detail & Related papers (2025-10-14T17:59:58Z)
Recurrence Meets Transformers for Universal Multimodal Retrieval [59.92546492752452]
ReT-2 is a unified retrieval model that supports multimodal queries composed of both images and text.<n>We evaluate ReT-2 on the challenging M2KR and M-BEIR benchmarks across different retrieval configurations.<n>When integrated into retrieval-augmented generation pipelines, ReT-2 also improves downstream performance on Encyclopedic-VQA and InfoSeek datasets.
arXiv Detail & Related papers (2025-09-10T18:00:29Z)
MMSearch-R1: Incentivizing LMMs to Search [49.889749277236376]
We present MMSearch-R1, the first end-to-end reinforcement learning framework that enables on-demand, multi-turn search in real-world Internet environments.<n>Our framework integrates both image and text search tools, allowing the model to reason about when and how to invoke them guided by an outcome-based reward with a search penalty.
arXiv Detail & Related papers (2025-06-25T17:59:42Z)
Iterative Self-Incentivization Empowers Large Language Models as Agentic Searchers [74.17516978246152]
Large language models (LLMs) have been widely integrated into information retrieval to advance traditional techniques.<n>We propose EXSEARCH, an agentic search framework, where the LLM learns to retrieve useful information as the reasoning unfolds.<n>Experiments on four knowledge-intensive benchmarks show that EXSEARCH substantially outperforms baselines.
arXiv Detail & Related papers (2025-05-26T15:27:55Z)
IDMR: Towards Instance-Driven Precise Visual Correspondence in Multimodal Retrieval [29.05476868272228]
Instance-Driven Multimodal Image Retrieval (IDMR) is a novel task that requires models to retrieve images containing the same instance as a query image while matching a text-described scenario. To benchmark this capability, we develop IDMR-bench using real-world object tracking and first-person video data. Our Multimodal Large Language Model (MLLM) based retrieval model, trained on 1.2M samples, outperforms state-of-the-art approaches on both traditional benchmarks and our zero-shot IDMR-bench.
arXiv Detail & Related papers (2025-04-01T16:47:20Z)
Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts [56.30364248231053]
This paper introduces Multi-Modal Retrieval-Augmented Generation (M2RAG) M2RAG is a benchmark designed to evaluate the effectiveness of Multi-modal Large Language Models (MLLMs) To enhance the context utilization capabilities of MLLMs, we also introduce Multi-Modal Retrieval-Augmented Instruction Tuning (MM-RAIT)
arXiv Detail & Related papers (2025-02-24T16:25:25Z)
Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences. We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries. We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z)
LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant [63.28378110792787]
We introduce LamRA, a versatile framework designed to empower Large Multimodal Models with sophisticated retrieval and reranking capabilities. For retrieval, we adopt a two-stage training strategy comprising language-only pre-training and multimodal instruction tuning. For reranking, we employ joint training for both pointwise and listwise reranking, offering two distinct ways to further boost the retrieval performance.
arXiv Detail & Related papers (2024-12-02T17:10:16Z)
Enhancing Long Context Performance in LLMs Through Inner Loop Query Mechanism [2.919891871101241]
Transformers have a quadratic scaling of computational complexity with input size. Retrieval-augmented generation (RAG) can better handle longer contexts by using a retrieval system. We introduce a novel approach, Inner Loop Memory Augmented Tree Retrieval (ILM-TR)
arXiv Detail & Related papers (2024-10-11T19:49:05Z)
UniMEL: A Unified Framework for Multimodal Entity Linking with Large Language Models [0.42832989850721054]
Multimodal Entities Linking (MEL) is a crucial task that aims at linking ambiguous mentions within multimodal contexts to referent entities in a multimodal knowledge base, such as Wikipedia. Existing methods overcomplicate the MEL task and overlook the visual semantic information, which makes them costly and hard to scale. We propose UniMEL, a unified framework which establishes a new paradigm to process multimodal entity linking tasks using Large Language Models.
arXiv Detail & Related papers (2024-07-23T03:58:08Z)
LightPAL: Lightweight Passage Retrieval for Open Domain Multi-Document Summarization [9.739781953744606]
Open-Domain Multi-Document Summarization (ODMDS) is the task of generating summaries from large document collections in response to user queries. Traditional retrieve-then-summarize approaches fall short for open-ended queries in ODMDS tasks. We propose LightPAL, a lightweight passage retrieval method for ODMDS.
arXiv Detail & Related papers (2024-06-18T10:57:27Z)
Needle In A Multimodal Haystack [79.81804334634408]
We present the first benchmark specifically designed to evaluate the capability of existing MLLMs to comprehend long multimodal documents. Our benchmark includes three types of evaluation tasks: multimodal retrieval, counting, and reasoning. We observe that existing models still have significant room for improvement on these tasks, especially on vision-centric evaluation.
arXiv Detail & Related papers (2024-06-11T13:09:16Z)
UniIR: Training and Benchmarking Universal Multimodal Information Retrievers [76.06249845401975]
We introduce UniIR, a unified instruction-guided multimodal retriever capable of handling eight distinct retrieval tasks across modalities. UniIR, a single retrieval system jointly trained on ten diverse multimodal-IR datasets, interprets user instructions to execute various retrieval tasks. We construct the M-BEIR, a multimodal retrieval benchmark with comprehensive results, to standardize the evaluation of universal multimodal information retrieval.
arXiv Detail & Related papers (2023-11-28T18:55:52Z)
End-to-end Knowledge Retrieval with Multi-modal Queries [50.01264794081951]
ReMuQ requires a system to retrieve knowledge from a large corpus by integrating contents from both text and image queries. We introduce a retriever model ReViz'' that can directly process input text and images to retrieve relevant knowledge in an end-to-end fashion. We demonstrate superior performance in retrieval on two datasets under zero-shot settings.
arXiv Detail & Related papers (2023-06-01T08:04:12Z)
Synergistic Interplay between Search and Large Language Models for Information Retrieval [141.18083677333848]
InteR allows RMs to expand knowledge in queries using LLM-generated knowledge collections. InteR achieves overall superior zero-shot retrieval performance compared to state-of-the-art methods.
arXiv Detail & Related papers (2023-05-12T11:58:15Z)
Large Language Models are Strong Zero-Shot Retriever [89.16756291653371]
We propose a simple method that applies a large language model (LLM) to large-scale retrieval in zero-shot scenarios. Our method, the Language language model as Retriever (LameR), is built upon no other neural models but an LLM.
arXiv Detail & Related papers (2023-04-27T14:45:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.