UME-R1: Exploring Reasoning-Driven Generative Multimodal Embeddings
- URL: http://arxiv.org/abs/2511.00405v1
- Date: Sat, 01 Nov 2025 05:04:23 GMT
- Title: UME-R1: Exploring Reasoning-Driven Generative Multimodal Embeddings
- Authors: Zhibin Lan, Liqiang Niu, Fandong Meng, Jie Zhou, Jinsong Su,
- Abstract summary: We pioneer the exploration of generative embeddings, unifying embedding tasks within a generative paradigm.<n>We propose UME-R1, a universal multimodal embedding framework consisting of a two-stage training strategy.<n> evaluated on the MMEB-V2 benchmark across 78 tasks spanning video, image, and visual documents.
- Score: 70.60608084375691
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The remarkable success of multimodal large language models (MLLMs) has driven advances in multimodal embeddings, yet existing models remain inherently discriminative, limiting their ability to benefit from reasoning-driven generation paradigm. In this work, we pioneer the exploration of generative embeddings, unifying embedding tasks within a generative paradigm. We propose UME-R1, a universal multimodal embedding framework consisting of a two-stage training strategy: a cold-start supervised fine-tuning equips the model with reasoning capabilities and enables it to generate both discriminative and generative embeddings; a subsequent reinforcement learning enhances reasoning and further optimizes generative embedding quality. This pioneering work reveals four key insights: 1) generative embeddings unlock substantial performance gains over conventional discriminative embeddings by leveraging the powerful generative reasoning capabilities of MLLMs; 2) discriminative and generative embeddings are complementary, whose combined oracle performance far exceeding that of either alone; 3) RL can effectively enhance generative embeddings, establishing a scalable optimization paradigm.; 4) repeated sampling at inference boosts downstream task coverage (pass@k), highlighting the inference-time scalability potential of generative embeddings. Evaluated on the MMEB-V2 benchmark across 78 tasks spanning video, image, and visual documents, UME-R1 significantly outperforms conventional discriminative embedding models and offers a foundation for more interpretable, reasoning-driven generative multimodal embeddings. Our code, models, and datasets will be publicly available at https://github.com/XMUDeepLIT/UME-R1.
Related papers
- Reconstructing Content via Collaborative Attention to Improve Multimodal Embedding Quality [59.651410243721045]
CoCoA is a Content reconstruction pre-training paradigm based on Collaborative Attention for multimodal embedding optimization.<n>We introduce an EOS-based reconstruction task, encouraging the model to reconstruct input from the corresponding EOS> embeddings.<n>Experiments on MMEB-V1 demonstrate that CoCoA built upon Qwen2-VL and Qwen2.5-VL significantly improves embedding quality.
arXiv Detail & Related papers (2026-03-02T05:34:45Z) - Embed-RL: Reinforcement Learning for Reasoning-Driven Multimodal Embeddings [44.77164359074224]
Multimodal Large Language Models (MLLMs) have become pivotal for advancing Universal Multimodal Embeddings (UME)<n>Recent studies demonstrate that incorporating generative Chain-of-Thought (CoT) reasoning can substantially enhance task-specific representations.<n>We propose a reasoning-driven UME framework that integrates Embedder-Guided Reinforcement Learning (EG-RL) to optimize the Reasoner to produce evidential Traceability CoT.
arXiv Detail & Related papers (2026-02-14T15:35:03Z) - Analyzing Diffusion and Autoregressive Vision Language Models in Multimodal Embedding Space [52.34072027212278]
Embedding models are a fundamental component of modern AI systems such as semantic search and retrieval-augmented generation.<n>Recent advances in large foundation models have substantially accelerated the development of embedding models.<n>We present the first systematic study of converting Multimodal dLLMs into embedding models.
arXiv Detail & Related papers (2026-01-19T06:51:15Z) - MMRAG-RFT: Two-stage Reinforcement Fine-tuning for Explainable Multi-modal Retrieval-augmented Generation [31.90681057778075]
Multi-modal Retrieval-Augmented Generation (MMRAG) enables highly credible generation by integrating external multi-modal knowledge.<n>Existing MMRAG methods fail to clarify the reasoning logic behind retrieval and response generation.
arXiv Detail & Related papers (2025-12-19T03:19:54Z) - STAR: STacked AutoRegressive Scheme for Unified Multimodal Learning [37.68078190711403]
We introduce STAR: a STacked AutoRegressive scheme for task-progressive unified multimodal learning.<n>This approach decomposes multimodal learning into multiple stages: understanding, generation, and editing.<n> Experiments demonstrate that STAR achieves state-of-the-art performance on GenEval (0.91), DPG-Bench (87.44), and ImgEdit (4.34)
arXiv Detail & Related papers (2025-12-15T07:02:59Z) - Reasoning Guided Embeddings: Leveraging MLLM Reasoning for Improved Multimodal Retrieval [25.629529312687694]
We propose Reasoning Guided Embeddings (RGE), which preserves the generative rationale process of Multimodal Large Language Models (MLLMs)<n>Our method first enables the model to perform structured rationale generation conditioned on the instruction, and then extracts representations after reasoning has unfolded.<n>Experiments on the MMEB benchmark show that reasoning-guided conditioning improves multimodal retrieval performance by 4.9% over the non-reasoning baseline.
arXiv Detail & Related papers (2025-11-20T08:44:47Z) - NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching [64.10695425442164]
We introduce NExT-OMNI, an open-source omnimodal foundation model that achieves unified modeling through discrete flow paradigms.<n>Trained on large-scale interleaved text, image, video, and audio data, NExT-OMNI delivers competitive performance on multimodal generation and understanding benchmarks.<n>To advance further research, we release training details, data protocols, and open-source both the code and model checkpoints.
arXiv Detail & Related papers (2025-10-15T16:25:18Z) - From Generator to Embedder: Harnessing Innate Abilities of Multimodal LLMs via Building Zero-Shot Discriminative Embedding Model [29.879983760203256]
Multimodal Large Language Models (MLLMs) have emerged as a promising solution for universal embedding tasks.<n>But adapting their generative nature for discriminative representation learning remains a significant challenge.<n>We propose an efficient framework for universal multimodal embeddings, which bridges the gap by centering on two synergistic components.
arXiv Detail & Related papers (2025-08-01T07:31:24Z) - MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings [75.0617088717528]
MoCa is a framework for transforming pre-trained VLM backbones into effective bidirectional embedding models.<n>MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results.
arXiv Detail & Related papers (2025-06-29T06:41:00Z) - SUDER: Self-Improving Unified Large Multimodal Models for Understanding and Generation with Dual Self-Rewards [55.99492656542475]
We propose textbfSUDER (textbfSelf-improving textbfUnified LMMs with textbfDual stextbfElf-textbfRewards), a framework reinforcing the understanding and generation capabilities of LMMs.
arXiv Detail & Related papers (2025-06-09T17:38:45Z) - CAFe: Unifying Representation and Generation with Contrastive-Autoregressive Finetuning [24.981279071712173]
We introduce CAFe, a contrastive-autoregressive fine-tuning framework that enhances LVLMs for both representation and generative tasks.<n>Our approach unifies these traditionally separate tasks, achieving state-of-the-art results in both multimodal retrieval and multimodal generative benchmarks.
arXiv Detail & Related papers (2025-03-25T17:57:17Z) - Multi-modal Generative AI: Multi-modal LLMs, Diffusions and the Unification [41.88402339122694]
Multi-modal generative AI (Artificial Intelligence) has attracted increasing attention from both academia and industry.<n>This paper provides a comprehensive overview of multi-modal generative AI, including multi-modal LLMs, diffusions, and the unification for understanding and generation.
arXiv Detail & Related papers (2024-09-23T13:16:09Z) - Generative-based Fusion Mechanism for Multi-Modal Tracking [35.77340348091937]
We introduce Conditional Generative Adversarial Networks (CGANs) and Diffusion Models (DMs)
We condition these multi-modal features with random noise in the GM framework, effectively transforming the original training samples into harder instances.
This design excels at extracting discriminative clues from the features, enhancing the ultimate tracking performance.
arXiv Detail & Related papers (2023-09-04T17:22:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.