Fugu-MT 論文翻訳(概要): BRIDGE: Multimodal-to-Text Retrieval via Reinforcement-Learned Query Alignment

論文の概要: BRIDGE: Multimodal-to-Text Retrieval via Reinforcement-Learned Query Alignment

arxiv url: http://arxiv.org/abs/2604.07201v1
Date: Wed, 08 Apr 2026 15:28:21 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-09 17:30:51.608769
Title: BRIDGE: Multimodal-to-Text Retrieval via Reinforcement-Learned Query Alignment
Title（参考訳）: BRIDGE: 強化学習クエリアライメントによるマルチモーダルテキスト検索
Authors: Mohamed Darwish Mounis, Mohamed Mahmoud, Shaimaa Sedek, Mahmoud Abdalla, Mahmoud SalahEldin Kasem, Abdelrahman Abdallah, Hyun-Soo Kang,
Abstract要約: 最高の視覚言語エンコーダはMM-BRIGHT上で27.6 nDCG@10しか達成せず、強いテキストのみのレシーバーよりも優れています。マルチモーダルエンコーダを使わずにこのミスマッチを解消する2成分システムである textbfBRIDGE を提案する。 textbfFORGEは強化学習によって訓練されたクエリアライメントモデルであり、ノイズの多いマルチモーダルクエリをコンパクトで検索最適化された検索文字列に蒸留する。 textbfLENSは、ForGEが生成するインテントリッチクエリを処理するために、推論集約検索データに基づいて微調整された、推論強化の高密度レトリバーである。
参考スコア（独自算出の注目度）: 5.285385905661152
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal retrieval systems struggle to resolve image-text queries against text-only corpora: the best vision-language encoder achieves only 27.6 nDCG@10 on MM-BRIGHT, underperforming strong text-only retrievers. We argue the bottleneck is not the retriever but the query -- raw multimodal queries entangle visual descriptions, conversational noise, and retrieval intent in ways that systematically degrade embedding similarity. We present \textbf{BRIDGE}, a two-component system that resolves this mismatch without multimodal encoders. \textbf{FORGE} (\textbf{F}ocused Retrieval Query Generato\textbf{r}) is a query alignment model trained via reinforcement learning, which distills noisy multimodal queries into compact, retrieval-optimized search strings. \textbf{LENS} (\textbf{L}anguage-\textbf{E}nhanced \textbf{N}eural \textbf{S}earch) is a reasoning-enhanced dense retriever fine-tuned on reasoning-intensive retrieval data to handle the intent-rich queries FORGE produces. Evaluated on MM-BRIGHT (2,803 queries, 29 domains), BRIDGE achieves \textbf{29.7} nDCG@10, surpassing all multimodal encoder baselines including Nomic-Vision (27.6). When FORGE is applied as a plug-and-play aligner on top of Nomic-Vision, the combined system reaches \textbf{33.3} nDCG@10 -- exceeding the best text-only retriever (32.2) -- demonstrating that \textit{query alignment} is the key bottleneck in multimodal-to-text retrieval. https://github.com/mm-bright/multimodal-reasoning-retrieval
Abstract（参考訳）: 最高の視覚言語エンコーダは、MM-BRIGHT上で27.6 nDCG@10しか達成せず、強いテキストのみの検索器を弱めている。我々は、ボトルネックは検索者ではなく、クエリー -- 生のマルチモーダルクエリーは、視覚的記述、会話のノイズ、検索意図を、組込み類似性を体系的に劣化させる方法で絡み合わせている、と論じている。マルチモーダルエンコーダを使わずにこのミスマッチを解消する2成分システムであるtextbf{BRIDGE} を提案する。 textbf{FORGE} (\textbf{F}ocused Retrieval Query Generato\textbf{r}) は、強化学習によって訓練されたクエリアライメントモデルであり、ノイズの多いマルチモーダルクエリをコンパクトで検索最適化された検索文字列に蒸留する。 textbf{LENS} (\textbf{L}anguage-\textbf{E}nhanced \textbf{N}eural \textbf{S}earch) は、意図に富んだクエリ FORGE が生成する推論集約検索データに基づいて微調整された、推論に富んだ高密度検索である。 MM-BRIGHT (2,803クエリ、29ドメイン) に基づいて評価され、BRIDGE は \textbf{29.7} nDCG@10 を達成し、Nomic-Vision (27.6) を含むすべてのマルチモーダルエンコーダベースラインを超える。 Nomic-Visionの上のプラグ・アンド・プレイ・アライメントとしてFOGEが適用されると、複合システムは \textbf{33.3} nDCG@10 に到達し、最高のテキストのみのレトリバー (32.2) を超え、 \textit{query alignment} がマルチモーダル・テキスト検索における重要なボトルネックであることを実証する。 https://github.com/mm-bright/multimodal-reasoning-retrieval

論文の概要: BRIDGE: Multimodal-to-Text Retrieval via Reinforcement-Learned Query Alignment

関連論文リスト