Fugu-MT 論文翻訳(概要): DISA: Offline Importance Sampling for Distribution-Matching LLM-RL

論文の概要: DISA: Offline Importance Sampling for Distribution-Matching LLM-RL

arxiv url: http://arxiv.org/abs/2605.17295v1
Date: Sun, 17 May 2026 07:14:44 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 17:57:47.837514
Title: DISA: Offline Importance Sampling for Distribution-Matching LLM-RL
Title（参考訳）: DISA: 配電型LLM-RLにおけるオフライン重要度サンプリング
Authors: Shaobo Wang, Yujie Chen, Yafeng Sun, Wenjie Qiu, Zhihui Xie, Sihang Li, Yucheng Li, Huiqiang Jiang, Xingzhang Ren, Xuming Hu, Dayiheng Liu, Linfeng Zhang,
Abstract要約: 本稿では、このキャリブレーション問題をRLループの外に移動させるdisAを紹介する。 DISAは提案トラジェクトリをオフラインに描画し、重要サンプリングによってパーティション関数を推定し、結果として発生するパーティション関数の推定を凍結する。 6つの数学と3つのコードベンチマークにまたがる2つのオープンウェイトなバックボーンでは、DisdisAはオンラインに結合した分散マッチングベースラインフローにマッチするか、超えている。
参考スコア（独自算出の注目度）: 56.9445657766829
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Modern reasoning agents are increasingly evaluated on their ability to generate multiple valid solution paths, plans, or tool-use traces for a given input. Standard reward-maximizing RL tends to collapse onto the most easily reinforced high-reward mode, whereas distribution-matching RL aims to allocate probability mass across the entire reward-shaped solution set. Achieving this objective requires computing a prompt-dependent partition function over the trajectory space. Because existing distribution-matching methods learn this partition function online alongside the policy, calibration errors in the partition function directly distort policy updates and remain impossible to diagnose independently. We introduce DISA, short for Decoupled Importance-Sampled Anchoring, which moves this calibration problem outside the RL loop. DISA draws proposal trajectories offline, estimates the partition function via importance sampling, and freezes the resulting partition-function estimate before policy optimization begins. This decoupling preserves the distribution-matching objective while strictly separating partition-function estimation from policy learning in data, gradients, loss, and diagnostics. Empirically, on two open-weight backbones across six math and three code benchmarks, DISA matches or exceeds the online-coupled distribution-matching baseline FlowRL, outperforms rewardmaximization baselines GRPO and GSPO on math averages, and exceeds LoRASFT distillation by up to 13.8 Mean@8 points on the same offline trajectories. An LLM-as-judge evaluation further shows that DISA retains substantially more strategy-level diversity than reward-maximization baselines, and sensitivity studies on the proposal strength and inverse temperature follow the bias-variance pattern predicted by the analysis.
Abstract（参考訳）: 現代の推論エージェントは、与えられた入力に対して複数の有効なソリューションパス、計画、ツール使用トレースを生成する能力について、ますます評価されている。標準的な報酬最大化 RL は最も容易に強化された高逆モードに崩壊する傾向にあり、一方分布マッチング RL は報酬型解集合全体にわたって確率質量を割り当てることを目的としている。この目的を達成するには、軌道空間上のプロンプト依存のパーティション関数を計算する必要がある。既存の分散マッチング手法は、この分割関数をポリシーとともにオンラインで学習するため、分割関数の校正エラーはポリシー更新を直接歪め、独立して診断することは不可能である。本稿では、DEA(Decoupled Importance-Sampled Anchoring)を略して導入し、このキャリブレーション問題をRLループの外に移動させる。 DISAは提案の軌跡をオフラインに描画し、重要サンプリングによって分割関数を推定し、ポリシー最適化が始まる前に分割関数の推定結果を凍結する。このデカップリングは、データ、勾配、損失、診断におけるポリシー学習から分割関数推定を厳密に分離しながら、分散マッチングの目的を保っている。実証的には、6つの数学と3つのコードベンチマークの2つのオープンウェイトバックボーンにおいて、disAはオンラインに結合した分布マッチングベースラインFlowRLと一致または超え、数学平均で報酬最大化ベースラインGRPOとGSPOを上回り、同じオフライン軌道上で最大13.8Mean@8ポイントのLoRASFT蒸留を上回る。 LLM-as-judge の評価により,disA は報酬-最大化ベースラインよりも戦略レベルの多様性を保ち,提案強度と逆温度に対する感度研究は分析によって予測されるバイアス-ばらつきパターンに従っていることが示された。

論文の概要: DISA: Offline Importance Sampling for Distribution-Matching LLM-RL

関連論文リスト