Fugu-MT 論文翻訳(概要): PLUME: Latent Reasoning Based Universal Multimodal Embedding

論文の概要: PLUME: Latent Reasoning Based Universal Multimodal Embedding

arxiv url: http://arxiv.org/abs/2604.02073v1
Date: Thu, 02 Apr 2026 14:04:53 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-03 14:21:10.840908
Title: PLUME: Latent Reasoning Based Universal Multimodal Embedding
Title（参考訳）: PLUME: 遅延推論に基づくユニバーサルマルチモーダル埋め込み
Authors: Chenwei He, Xiangzhao Hao, Tianyu Yang, Yuxiang Ma, Yuheng Jia, Lingxiang Wu, Chaoyang Zhao, Haiyun Guo, Jinqiao Wang,
Abstract要約: ユニバーサルマルチモーダル埋め込み(UME)は、異種入力を単一のモデルで共有検索空間にマッピングする。最近のアプローチでは、埋め込みを抽出する前に明確なチェーン・オブ・シント(CoT)論理を生成することにより、UMEを改善している。 PLUMEは,言語化されたCoTを連続的潜伏状態の短時間の自己回帰ロールアウトに置き換えることで,UMEを進化させる潜在的推論フレームワークである。
参考スコア（独自算出の注目度）: 52.35354073629127
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Universal multimodal embedding (UME) maps heterogeneous inputs into a shared retrieval space with a single model. Recent approaches improve UME by generating explicit chain-of-thought (CoT) rationales before extracting embeddings, enabling multimodal large language models to better infer complex query intent. However, explicit CoT incurs substantial inference overhead and can compress rich multimodal evidence into a narrow textual bottleneck. We propose PLUME, a latent reasoning framework that advances UME by replacing verbalized CoT with a short autoregressive rollout of continuous latent states. To support diverse multimodal queries, PLUME further introduces a semantic-anchor-guided transition adapter that steers latent rollout along different reasoning trajectories under the same fixed computation budget. To stabilize training, PLUME adopts a progressive explicit-to-latent curriculum that uses verbalized reasoning only as a temporary training scaffold and gradually transfers this behavior into hidden-state computation, eliminating explicit CoT at inference. On the 78-task MMEB-v2 benchmark, PLUME outperforms strong explicit-CoT UME baselines while reducing reasoning from hundreds of generated tokens to fewer than 10 latent steps, delivering over 30x faster inference. PLUME is especially well suited to retrieval settings where relevant evidence is dense, structurally complex, and difficult to organize through verbalized intermediate rationales, such as video and visual document retrieval. These results show that structured latent computation can preserve the benefits of intermediate reasoning without the overhead of explicit rationale generation, providing a stronger and more efficient paradigm for practical retrieval systems.
Abstract（参考訳）: ユニバーサルマルチモーダル埋め込み(UME)は、異種入力を単一のモデルで共有検索空間にマッピングする。近年のアプローチでは、埋め込みを抽出する前に明示的なチェーン・オブ・シント(CoT)論理を生成することにより、UMEを改善している。しかし、明示的なCoTは、かなりの推測オーバーヘッドをもたらし、リッチなマルチモーダルなエビデンスを狭いテキストボトルネックに圧縮することができる。 PLUMEは,言語化されたCoTを連続的潜伏状態の短時間の自己回帰ロールアウトに置き換えることで,UMEを進化させる潜在的推論フレームワークである。多様なマルチモーダルクエリをサポートするために、PLUMEは、同じ固定計算予算の下で異なる推論軌道に沿って遅延ロールアウトを行うセマンティックアンカー誘導遷移アダプタも導入している。トレーニングを安定させるために、PLUMEは、言語推論を一時的なトレーニングの足場としてのみ使用し、この振る舞いを徐々に隠れ状態の計算に移行し、推論時に明示的なCoTを排除した、プログレッシブな明示的-相対的カリキュラムを採用している。 78タスクのMMEB-v2ベンチマークでは、PLUMEは強力な明示的なCoT UMEベースラインを上回り、数百の生成されたトークンからの推論を10ステップ未満に削減し、30倍以上高速な推論を実現している。 PLUMEは特に、関連する証拠が密集し、構造的に複雑であり、ビデオや視覚文書の検索のような言語化された中間的論理によって組織化が難しい検索設定に適している。これらの結果から、構造化された潜在計算は、明示的な有理数生成のオーバーヘッドを伴わずに中間推論の利点を保ち、実用的な検索システムにおいてより強力で効率的なパラダイムを提供することができることが示された。

論文の概要: PLUME: Latent Reasoning Based Universal Multimodal Embedding

関連論文リスト