Fugu-MT 論文翻訳(概要): LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer?

論文の概要: LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer?

arxiv url: http://arxiv.org/abs/2605.11301v1
Date: Mon, 11 May 2026 22:42:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-13 21:48:56.46107
Title: LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer?
Title（参考訳）: LatentRouter: 答えを見る前に、正しいマルチモーダルモデルを選ぶことができるか?
Authors: Xueqi Cheng, Yushun Dong,
Abstract要約: マルチモーダル大言語モデル(MLLM)は、OCR、チャート理解、空間的推論、視覚的質問応答、コスト、レイテンシにまたがるヘテロジニアスな強度を持つ。本稿では,MLLMルーティングを実効的マルチモーダルユーティリティ予測として定式化するルータであるLatentを提案する。 MMR-BenchとVL-Benchの実験では、Latentは固定モデル、特徴レベル、学習ルータベースラインよりも優れていた。
参考スコア（独自算出の注目度）: 69.71754384259167
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal large language models (MLLMs) have heterogeneous strengths across OCR, chart understanding, spatial reasoning, visual question answering, cost, and latency. Effective MLLM routing therefore requires more than estimating query difficulty: a router must match the multimodal requirements of the current image-question input with the capabilities of each candidate model. We propose LatentRouter, a router that formulates MLLM routing as counterfactual multimodal utility prediction. Given an image-question query, LatentRouter extracts learned multimodal routing capsules, represents each candidate MLLM with a model capability token, and performs latent communication between these states to estimate how each model would perform if selected. A distributional outcome head predicts model-specific counterfactual quality, while a bounded capsule correction refines close decisions without allowing residual signals to dominate the prediction. The resulting utility-based policy supports performance-oriented and performance-cost routing, and handles changing candidate pools through shared per-model scoring with availability masking. Experiments on MMR-Bench and VL-RouterBench show that LatentRouter outperforms fixed-model, feature-level, and learned-router baselines. Additional analyses show that the gains are strongest on multimodal task groups where model choice depends on visual, layout-sensitive, or reasoning-oriented requirements, and that latent communication is the main contributor to the improvement. The code is available at: https://github.com/LabRAI/LatentRouter.
Abstract（参考訳）: マルチモーダル大言語モデル(MLLM)は、OCR、チャート理解、空間的推論、視覚的質問応答、コスト、レイテンシにまたがるヘテロジニアスな強度を持つ。ルータは、現在の画像検索入力のマルチモーダル要求と、各候補モデルの能力とを一致させなければならない。本稿では,MLLMルーティングを実効的マルチモーダルユーティリティ予測として定式化するルータであるLatentRouterを提案する。画像検索クエリが与えられた後、LatentRouterは学習したマルチモーダルルーティングカプセルを抽出し、各候補MLLMをモデル能力トークンで表現し、これらの状態間の遅延通信を行い、選択されたモデルがどのように動作するかを推定する。分布結果ヘッドは、モデル固有の対物品質を予測し、有界カプセル補正は、残留信号が予測を支配することを許さず、密接な決定を洗練する。結果として生じるユーティリティベースのポリシは、パフォーマンス指向とパフォーマンスコストのルーティングをサポートし、アベイラビリティマスキングによるモデル毎のスコアリングを通じて、候補プールの変更を処理します。 MMR-BenchとVL-RouterBenchの実験では、LatentRouterは固定モデル、特徴レベル、学習ルーターベースラインよりも優れていた。さらなる分析では、モデル選択が視覚的、レイアウトに敏感な、あるいは推論指向の要求に依存しているマルチモーダルなタスクグループが最も多く、潜在的コミュニケーションが改善の主な要因であることを示している。コードは、https://github.com/LabRAI/LatentRouter.comで入手できる。

論文の概要: LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer?

関連論文リスト