Fugu-MT 論文翻訳(概要): Think Then Embed: Generative Context Improves Multimodal Embedding

論文の概要: Think Then Embed: Generative Context Improves Multimodal Embedding

arxiv url: http://arxiv.org/abs/2510.05014v1
Date: Mon, 06 Oct 2025 16:53:56 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-07 16:52:59.996414
Title: Think Then Embed: Generative Context Improves Multimodal Embedding
Title（参考訳）: Think then Embed: 生成コンテキストがマルチモーダルな埋め込みを改善する
Authors: Xuanming Cui, Jianpeng Cheng, Hong-you Chen, Satya Narayan Shukla, Abhijeet Awasthi, Xichen Pan, Chaitanya Ahuja, Shlok Kumar Mishra, Qi Guo, Ser-Nam Lim, Aashu Singh, Xiangjun Fan,
Abstract要約: 本稿では,ユニバーサル・マルチモーダル・エンベディング (UME) のためのThink-Then-Embed (TTE) フレームワークを提案する。強力なMLLM推論器を利用することで、MMEB-V2ベンチマークで最先端のパフォーマンスを達成し、大規模な社内データセットでトレーニングされたプロプライエタリモデルを上回った。
参考スコア（独自算出の注目度）: 47.493748186420966
License: http://creativecommons.org/licenses/by/4.0/
Abstract: There is a growing interest in Universal Multimodal Embeddings (UME), where models are required to generate task-specific representations. While recent studies show that Multimodal Large Language Models (MLLMs) perform well on such tasks, they treat MLLMs solely as encoders, overlooking their generative capacity. However, such an encoding paradigm becomes less effective as instructions become more complex and require compositional reasoning. Inspired by the proven effectiveness of chain-of-thought reasoning, we propose a general Think-Then-Embed (TTE) framework for UME, composed of a reasoner and an embedder. The reasoner MLLM first generates reasoning traces that explain complex queries, followed by an embedder that produces representations conditioned on both the original query and the intermediate reasoning. This explicit reasoning step enables more nuanced understanding of complex multimodal instructions. Our contributions are threefold. First, by leveraging a powerful MLLM reasoner, we achieve state-of-the-art performance on the MMEB-V2 benchmark, surpassing proprietary models trained on massive in-house datasets. Second, to reduce the dependency on large MLLM reasoners, we finetune a smaller MLLM reasoner using high-quality embedding-centric reasoning traces, achieving the best performance among open-source models with a 7% absolute gain over recently proposed models. Third, we investigate strategies for integrating the reasoner and embedder into a unified model for improved efficiency without sacrificing performance.
Abstract（参考訳）: UME(Universal Multimodal Embeddings)は、タスク固有の表現を生成するためにモデルを必要とする。近年の研究では、MLLM(Multimodal Large Language Models, Multimodal Large Language Models, MLLM)がこのようなタスクでうまく機能していることが示されているが、彼らはMLLMをエンコーダとしてのみ扱い、その生成能力を見越す。しかし、このような符号化パラダイムは、命令がより複雑になり、構成的推論を必要とするため、効果が低下する。チェーン・オブ・シークレット・推論の実証された効果に触発されて、我々は、理性体と埋め込み体からなる、UMEのための一般的なシンク・ザ・エンベッド(TTE)フレームワークを提案する。 MLLMは、まず、複雑なクエリを説明する推論トレースを生成し、次に、元のクエリと中間推論の両方で条件付き表現を生成する埋め込み器を生成する。この明示的な推論ステップは、複雑なマルチモーダル命令のよりニュアンスな理解を可能にする。私たちの貢献は3倍です。まず、強力なMLLM推論器を活用することで、MMEB-V2ベンチマークで最先端のパフォーマンスを達成し、大規模な社内データセットでトレーニングされたプロプライエタリモデルを上回っます。第2に,大規模なMLLM推論器への依存性を低減するため,高品質な埋め込み中心推論トレースを用いてより小さなMLLM推論器を微調整し,最近提案されたモデルに対して7%の絶対ゲインを達成した。第3に、性能を犠牲にすることなく効率を向上させる統一モデルに推論器と埋め込み器を統合する戦略について検討する。

論文の概要: Think Then Embed: Generative Context Improves Multimodal Embedding

関連論文リスト