Fugu-MT 論文翻訳(概要): Expected Reward Prediction, with Applications to Model Routing

論文の概要: Expected Reward Prediction, with Applications to Model Routing

arxiv url: http://arxiv.org/abs/2603.20217v1
Date: Tue, 03 Mar 2026 10:10:44 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-06 02:36:12.898945
Title: Expected Reward Prediction, with Applications to Model Routing
Title（参考訳）: 予測リワード予測とモデルルーティングへの応用
Authors: Kenan Hasanaliyev, Silas Alberti, Jenny Hamer, Dheeraj Rajagopal, Kevin Robinson, Jasper Snoek, Victor Veitch, Alexander Nicholas D'Amour,
Abstract要約: 繰り返しサンプリングを行うと, LLMが報酬モデルから得られると予測される報酬を予測することは容易である。また、これらの予測された報酬予測は、モデルルーティングプロトコルへのアプリケーションをサポートするのに十分正確かつ差別的であることを示す。
参考スコア（独自算出の注目度）: 51.74583237294919
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reward models are a standard tool to score responses from LLMs. Reward models are built to rank responses to a fixed prompt sampled from a single model, for example to choose the best of n sampled responses. In this paper, we study whether scores from response-level reward models lifted to score a model's suitability for a prompt, prior to seeing responses from that model. Specifically, we show that it is straightforward to predict the expected reward that an LLM would earn from the reward model under repeated sampling. Further, we show that these expected reward predictions are precise and discriminative enough to support an application to a model routing protocol that routes prompts to models at inference time to maximize reward while controlling computational cost. We demonstrate the performance of this routing procedure on the open-perfectblend dataset, using a model pool composed of Llama3.1-Instruct 8B/70B, Gemma2-IT 9B/27B, and Gemma1-IT 7B models. Our simple expected reward prediction--based routing (ERP) outperforms baselines that route prompts to models with the best average performance within each prompt's category, and explains the success of more complex routing protocols that implicitly estimate an expected reward. Our approach has the added advantage of being trivially extensible as new models are added to the pool.
Abstract（参考訳）: 逆モデル(Reward model)は、LSMからの応答をスコアする標準的なツールである。リワードモデルは、1つのモデルからサンプリングされた固定されたプロンプトに対する応答をランク付けするために構築される。本稿では,応答レベルの報酬モデルから得られるスコアが,そのモデルからの応答を見る前に,モデルがプロンプトに適合するかどうかを検討する。具体的には,繰り返しサンプリングによって LLM が報酬モデルから得られると期待される報酬を予測することは容易であることを示す。さらに、これらの予測された報酬予測は、計算コストを制御しながら報酬を最大化するために、推論時にモデルにプロンプトをルーティングするモデルルーティングプロトコルへのアプリケーションを支援するのに十分正確かつ差別的であることを示す。 Llama3.1-Instruct 8B/70B, Gemma2-IT 9B/27B, Gemma1-IT 7Bモデルからなるモデルプールを用いて, オープン・パーフェクトブレンド・データセット上でのルーティング手法の性能を示す。我々の単純な予測報酬予測に基づくルーティング(ERP)は、各プロンプトのカテゴリで最高の平均性能を持つモデルにルートを誘導するベースラインよりも優れており、期待される報酬を暗黙的に見積もるより複雑なルーティングプロトコルの成功を説明する。私たちのアプローチは、新しいモデルがプールに追加されるにつれて、自明に拡張可能であるというアドバンテージを追加しています。

論文の概要: Expected Reward Prediction, with Applications to Model Routing

関連論文リスト