Recurrence Meets Transformers for Universal Multimodal Retrieval
- URL: http://arxiv.org/abs/2509.08897v1
- Date: Wed, 10 Sep 2025 18:00:29 GMT
- Title: Recurrence Meets Transformers for Universal Multimodal Retrieval
- Authors: Davide Caffagni, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara,
- Abstract summary: ReT-2 is a unified retrieval model that supports multimodal queries composed of both images and text.<n>We evaluate ReT-2 on the challenging M2KR and M-BEIR benchmarks across different retrieval configurations.<n>When integrated into retrieval-augmented generation pipelines, ReT-2 also improves downstream performance on Encyclopedic-VQA and InfoSeek datasets.
- Score: 59.92546492752452
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the rapid advancement of multimodal retrieval and its application in LLMs and multimodal LLMs, increasingly complex retrieval tasks have emerged. Existing methods predominantly rely on task-specific fine-tuning of vision-language models and are limited to single-modality queries or documents. In this paper, we propose ReT-2, a unified retrieval model that supports multimodal queries, composed of both images and text, and searches across multimodal document collections where text and images coexist. ReT-2 leverages multi-layer representations and a recurrent Transformer architecture with LSTM-inspired gating mechanisms to dynamically integrate information across layers and modalities, capturing fine-grained visual and textual details. We evaluate ReT-2 on the challenging M2KR and M-BEIR benchmarks across different retrieval configurations. Results demonstrate that ReT-2 consistently achieves state-of-the-art performance across diverse settings, while offering faster inference and reduced memory usage compared to prior approaches. When integrated into retrieval-augmented generation pipelines, ReT-2 also improves downstream performance on Encyclopedic-VQA and InfoSeek datasets. Our source code and trained models are publicly available at: https://github.com/aimagelab/ReT-2
Related papers
- Exploring a Unified Vision-Centric Contrastive Alternatives on Multi-Modal Web Documents [99.62178668680578]
We propose Vision-Centric Contrastive Learning (VC2L), a unified framework that models text, images, and their combinations using a single vision transformer.<n> VC2L operates entirely in pixel space by rendering all inputs, whether textual, visual, or combined, as images.<n>To capture complex cross-modal relationships in web documents, VC2L employs a snippet-level contrastive learning objective that aligns consecutive multimodal segments.
arXiv Detail & Related papers (2025-10-21T14:59:29Z) - Visualized Text-to-Image Retrieval [55.178938325324864]
We propose Visualize-then-Retrieve (VisRet), a new paradigm for Text-to-Image (T2I) retrieval.<n>VisRet first projects textual queries into the image modality via T2I generation.<n>It then performs retrieval within the image modality to bypass the weaknesses of cross-modal retrievers in recognizing subtle visual-spatial features.
arXiv Detail & Related papers (2025-05-26T17:59:33Z) - A Multi-Granularity Retrieval Framework for Visually-Rich Documents [4.804551482123172]
We propose a unified multi-granularity multimodal retrieval framework tailored for two benchmark tasks: MMDocIR and M2KR.<n>Our approach integrates hierarchical encoding strategies, modality-aware retrieval mechanisms, and vision-language model (VLM)-based candidate filtering.<n>Our framework demonstrates robust performance without the need for task-specific fine-tuning.
arXiv Detail & Related papers (2025-05-01T02:40:30Z) - IDMR: Towards Instance-Driven Precise Visual Correspondence in Multimodal Retrieval [29.05476868272228]
Instance-Driven Multimodal Image Retrieval (IDMR) is a novel task that requires models to retrieve images containing the same instance as a query image while matching a text-described scenario.<n>To benchmark this capability, we develop IDMR-bench using real-world object tracking and first-person video data.<n>Our Multimodal Large Language Model (MLLM) based retrieval model, trained on 1.2M samples, outperforms state-of-the-art approaches on both traditional benchmarks and our zero-shot IDMR-bench.
arXiv Detail & Related papers (2025-04-01T16:47:20Z) - Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval [44.008094698200026]
Cross-modal retrieval is gaining increasing efficacy and interest from the research community.<n>In this paper, we design an approach that allows for multimodal queries composed of both an image and a text.<n>Our model, ReT, employs multi-level representations extracted from different layers of both visual and textual backbones.
arXiv Detail & Related papers (2025-03-03T19:01:17Z) - Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts [56.7225771305861]
This paper introduces Multi-Modal Retrieval-Augmented Generation (M$2$RAG), a benchmark designed to evaluate the effectiveness of Multi-modal Large Language Models.<n>The benchmark comprises four tasks: image captioning, multi-modal question answering, multi-modal fact verification, and image reranking.<n>To enhance the context utilization capabilities of MLLMs, we also introduce Multi-Modal Retrieval-Augmented Instruction Tuning (MM-RAIT)
arXiv Detail & Related papers (2025-02-24T16:25:25Z) - Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.<n>We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.<n>We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs [78.5013630951288]
This paper introduces techniques for advancing information retrieval with multimodal large language models (MLLMs)<n>We first study fine-tuning an MLLM as a bi-encoder retriever on 10 datasets with 16 retrieval tasks.<n>Our model, MM-Embed, achieves state-of-the-art performance on the multimodal retrieval benchmark M-BEIR.
arXiv Detail & Related papers (2024-11-04T20:06:34Z) - End-to-end Knowledge Retrieval with Multi-modal Queries [50.01264794081951]
ReMuQ requires a system to retrieve knowledge from a large corpus by integrating contents from both text and image queries.
We introduce a retriever model ReViz'' that can directly process input text and images to retrieve relevant knowledge in an end-to-end fashion.
We demonstrate superior performance in retrieval on two datasets under zero-shot settings.
arXiv Detail & Related papers (2023-06-01T08:04:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.