ReMatch: Boosting Representation through Matching for Multimodal Retrieval
- URL: http://arxiv.org/abs/2511.19278v2
- Date: Wed, 26 Nov 2025 03:26:54 GMT
- Title: ReMatch: Boosting Representation through Matching for Multimodal Retrieval
- Authors: Qianying Liu, Xiao Liang, Zhiqiang Zhang, Zhongfei Qing, Fengfan Zhou, Yibo Chen, Xu Tang, Yao Hu, Paul Henderson,
- Abstract summary: ReMatch is a framework that leverages the generative strength of MLLMs for multimodal retrieval.<n>We train the embedding MLLM end-to-end with a chat-style generative matching stage.<n>Our experiments show particularly strong zero-shot generalization results on five datasets.
- Score: 29.610030065465793
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present ReMatch, a framework that leverages the generative strength of MLLMs for multimodal retrieval. Previous approaches treated an MLLM as a simple encoder, ignoring its generative nature, and under-utilising its compositional reasoning and world knowledge. We instead train the embedding MLLM end-to-end with a chat-style generative matching stage. The matching stage uses the same MLLM to autoregressively decide relevance from multi-view inputs, including both raw data and its own projected embeddings for each query and document. It provides instance-wise discrimination supervision that complements a standard contrastive loss, offering stronger gradients on hard negatives and preserving the compositional strengths of the original MLLM. To obtain semantically richer multimodal embeddings, we use multiple learnable tokens to augment each input, generating fine-grained contextual, mutually orthogonal embeddings with low inference cost. Leveraging our established high-performance baseline,we assemble the ideas mentioned above into a powerful training recipe and achieve a new state-of-the-art on the Massive Multimodal Embedding Benchmark (MMEB). Our experiments show particularly strong zero-shot generalization results on five datasets, highlighting the robustness and transferability of ReMatch.
Related papers
- CREM: Compression-Driven Representation Enhancement for Multimodal Retrieval and Comprehension [49.6969505536365]
We propose CREM, with a unified framework that enhances multimodal representations for retrieval while preserving generative ability.<n>CREM achieves state-of-the-art retrieval performance on MMEB while maintaining strong generative performance on multiple comprehension benchmarks.
arXiv Detail & Related papers (2026-02-22T08:09:51Z) - MuCo: Multi-turn Contrastive Learning for Multimodal Embedding Model [57.89395815934156]
Multi-Turn Contrastive Learning (MuCo) is a dialogue-inspired framework that revisits this process.<n>Experiments exhibit MuCo with a newly curated 5M multimodal multi-turn dataset (M3T)
arXiv Detail & Related papers (2026-02-06T05:18:33Z) - Reasoning Guided Embeddings: Leveraging MLLM Reasoning for Improved Multimodal Retrieval [25.629529312687694]
We propose Reasoning Guided Embeddings (RGE), which preserves the generative rationale process of Multimodal Large Language Models (MLLMs)<n>Our method first enables the model to perform structured rationale generation conditioned on the instruction, and then extracts representations after reasoning has unfolded.<n>Experiments on the MMEB benchmark show that reasoning-guided conditioning improves multimodal retrieval performance by 4.9% over the non-reasoning baseline.
arXiv Detail & Related papers (2025-11-20T08:44:47Z) - MCA: Modality Composition Awareness for Robust Composed Multimodal Retrieval [34.21875369884307]
Multimodal large language models (MLLMs) enable a unified encoder that directly processes composed inputs.<n>While flexible and advanced, we identify that unified encoders trained with conventional contrastive learning are prone to learn modality shortcut.<n>We propose a modality composition awareness framework to mitigate this issue.
arXiv Detail & Related papers (2025-10-17T11:20:35Z) - UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning [101.62386137855704]
We present a novel Universal Multimodal Embedding (UniME-V2) model.<n>Our approach first constructs a potential hard negative set through global retrieval.<n>We then introduce the MLLM-as-a-Judge mechanism, which utilizes MLLMs to assess the semantic alignment of query-candidate pairs.<n>These scores serve as a foundation for hard negative mining, mitigating the impact of false negatives and enabling the identification of diverse, high-quality hard negatives.
arXiv Detail & Related papers (2025-10-15T13:07:00Z) - Empowering Large Language Model for Sequential Recommendation via Multimodal Embeddings and Semantic IDs [28.752042722391934]
Sequential recommendation (SR) aims to capture users' dynamic interests and sequential patterns based on their historical interactions.<n>MME-SID integrates multimodal embeddings and quantized embeddings to mitigate embedding collapse.<n>Extensive experiments on three public datasets validate the superior performance of MME-SID.
arXiv Detail & Related papers (2025-09-02T07:02:29Z) - U-MARVEL: Unveiling Key Factors for Universal Multimodal Retrieval via Embedding Learning with MLLMs [24.551034147718312]
Universal multimodal retrieval (UMR) aims to address complex retrieval tasks where both queries and candidates span diverse modalities.<n>We present a study aimed at uncovering the key factors that drive effective embedding learning for UMR using MLLMs.<n>We introduce a unified framework termed U-MARVEL, which outperforms state-of-the-art competitors on the M-B benchmark.
arXiv Detail & Related papers (2025-07-20T10:27:34Z) - Learning Item Representations Directly from Multimodal Features for Effective Recommendation [51.49251689107541]
multimodal recommender systems predominantly leverage Bayesian Personalized Ranking (BPR) optimization to learn item representations.<n>We propose a novel model (i.e., LIRDRec) that learns item representations directly from multimodal features to augment recommendation performance.
arXiv Detail & Related papers (2025-05-08T05:42:22Z) - Distilling Transitional Pattern to Large Language Models for Multimodal Session-based Recommendation [67.84581846180458]
Session-based recommendation (SBR) predicts the next item based on anonymous sessions.<n>Recent Multimodal SBR methods utilize simplistic pre-trained models for modality learning but have limitations in semantic richness.<n>We propose a multimodal LLM-enhanced framework TPAD, which extends a distillation paradigm to decouple and align transitional patterns for promoting MSBR.
arXiv Detail & Related papers (2025-04-13T07:49:08Z) - LLM-based Bi-level Multi-interest Learning Framework for Sequential Recommendation [54.396000434574454]
We propose a novel multi-interest SR framework combining implicit behavioral and explicit semantic perspectives.<n>It includes two modules: the Implicit Behavioral Interest Module and the Explicit Semantic Interest Module.<n>Experiments on four real-world datasets validate the framework's effectiveness and practicality.
arXiv Detail & Related papers (2024-11-14T13:00:23Z) - MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs [78.5013630951288]
This paper introduces techniques for advancing information retrieval with multimodal large language models (MLLMs)<n>We first study fine-tuning an MLLM as a bi-encoder retriever on 10 datasets with 16 retrieval tasks.<n>Our model, MM-Embed, achieves state-of-the-art performance on the multimodal retrieval benchmark M-BEIR.
arXiv Detail & Related papers (2024-11-04T20:06:34Z) - RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition [78.97487780589574]
Multimodal Large Language Models (MLLMs) excel at classifying fine-grained categories.
This paper introduces a Retrieving And Ranking augmented method for MLLMs.
Our proposed approach not only addresses the inherent limitations in fine-grained recognition but also preserves the model's comprehensive knowledge base.
arXiv Detail & Related papers (2024-03-20T17:59:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.