Fugu-MT 論文翻訳(概要): UME-R1: Exploring Reasoning-Driven Generative Multimodal Embeddings

論文の概要: UME-R1: Exploring Reasoning-Driven Generative Multimodal Embeddings

arxiv url: http://arxiv.org/abs/2511.00405v1
Date: Sat, 01 Nov 2025 05:04:23 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-05 16:37:26.757085
Title: UME-R1: Exploring Reasoning-Driven Generative Multimodal Embeddings
Title（参考訳）: UME-R1: Reasoning-Driven Generative Multimodal Embeddingsの探索
Authors: Zhibin Lan, Liqiang Niu, Fandong Meng, Jie Zhou, Jinsong Su,
Abstract要約: 我々は、生成的埋め込みの探索の先駆者であり、生成的パラダイム内の埋め込みタスクを統合する。 UME-R1は,2段階のトレーニング戦略からなる汎用なマルチモーダル埋め込みフレームワークである。ビデオ、画像、ビジュアルドキュメントにまたがる78タスクにわたるMMEB-V2ベンチマークで評価した。
参考スコア（独自算出の注目度）: 70.60608084375691
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The remarkable success of multimodal large language models (MLLMs) has driven advances in multimodal embeddings, yet existing models remain inherently discriminative, limiting their ability to benefit from reasoning-driven generation paradigm. In this work, we pioneer the exploration of generative embeddings, unifying embedding tasks within a generative paradigm. We propose UME-R1, a universal multimodal embedding framework consisting of a two-stage training strategy: a cold-start supervised fine-tuning equips the model with reasoning capabilities and enables it to generate both discriminative and generative embeddings; a subsequent reinforcement learning enhances reasoning and further optimizes generative embedding quality. This pioneering work reveals four key insights: 1) generative embeddings unlock substantial performance gains over conventional discriminative embeddings by leveraging the powerful generative reasoning capabilities of MLLMs; 2) discriminative and generative embeddings are complementary, whose combined oracle performance far exceeding that of either alone; 3) RL can effectively enhance generative embeddings, establishing a scalable optimization paradigm.; 4) repeated sampling at inference boosts downstream task coverage (pass@k), highlighting the inference-time scalability potential of generative embeddings. Evaluated on the MMEB-V2 benchmark across 78 tasks spanning video, image, and visual documents, UME-R1 significantly outperforms conventional discriminative embedding models and offers a foundation for more interpretable, reasoning-driven generative multimodal embeddings. Our code, models, and datasets will be publicly available at https://github.com/XMUDeepLIT/UME-R1.
Abstract（参考訳）: MLLM(Multimodal large language model)の顕著な成功は、マルチモーダル埋め込みの進歩を促しているが、既存のモデルは本質的に差別的であり、推論駆動型生成パラダイムの恩恵を受ける能力を制限する。本研究は、生成的埋め込みの探索の先駆者であり、生成的パラダイム内での埋め込みタスクの統合である。 UME-R1は,2段階の学習戦略からなる普遍的なマルチモーダル埋め込みフレームワークである。冷間開始制御による微調整により,モデルに推論能力を持たせ,識別的および生成的埋め込みの両方を生成可能とし,その後の強化学習により推論を強化し,生成的埋め込み品質をさらに最適化する。この先駆的な研究は4つの重要な洞察を明らかにしている。 1)ジェネレーティブ埋め込みは、MLLMの強力なジェネレーティブ推論能力を活用することにより、従来の識別的埋め込みよりも実質的なパフォーマンス向上を実現する。 2 識別的及び生成的埋入物は、相補的であり、その組合せのオラクル性能は、いずれの単独よりも遥かに優れている。 3) RL は生成的埋め込みを効果的に強化し, スケーラブルな最適化パラダイムを確立する。 ; 4) 推論における繰り返しサンプリングは、下流タスクカバレッジ(pass@k)を高め、生成的埋め込みの推論時間スケーラビリティの可能性を強調します。 UME-R1は、ビデオ、画像、ヴィジュアルドキュメントにまたがる78のタスクにわたるMMEB-V2ベンチマークに基づいて評価され、従来の識別的埋め込みモデルよりも著しく優れ、より解釈可能で推論駆動型な生成的マルチモーダル埋め込みの基礎を提供する。私たちのコード、モデル、データセットはhttps://github.com/XMUDeepLIT/UME-R1.comで公開されます。

論文の概要: UME-R1: Exploring Reasoning-Driven Generative Multimodal Embeddings

関連論文リスト