Fugu-MT 論文翻訳(概要): BOTS: A Unified Framework for Bayesian Online Task Selection in LLM Reinforcement Finetuning

論文の概要: BOTS: A Unified Framework for Bayesian Online Task Selection in LLM Reinforcement Finetuning

arxiv url: http://arxiv.org/abs/2510.26374v2
Date: Thu, 06 Nov 2025 09:27:20 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-07 13:46:06.46892
Title: BOTS: A Unified Framework for Bayesian Online Task Selection in LLM Reinforcement Finetuning
Title（参考訳）: BOTS: LLM強化ファインタニングにおけるベイズオンラインタスク選択のための統一フレームワーク
Authors: Qianli Shen, Daoyuan Chen, Yilun Huang, Zhenqing Ling, Yaliang Li, Bolin Ding, Jingren Zhou,
Abstract要約: 強化微調整(Reinforcement Finetuning, RFT)は、大規模言語モデル(LLM)を人間の嗜好と整合させ、推論を強化するための重要な手法である。 RFT強化微調整におけるベイズオンラインタスク選択のための統合フレームワークBOTSを紹介する。
参考スコア（独自算出の注目度）: 82.925106913459
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement finetuning (RFT) is a key technique for aligning Large Language Models (LLMs) with human preferences and enhancing reasoning, yet its effectiveness is highly sensitive to which tasks are explored during training. Uniform task sampling is inefficient, wasting computation on tasks that are either trivial or unsolvable, while existing task selection methods often suffer from high rollout costs, poor adaptivity, or incomplete evidence. We introduce BOTS, a unified framework for Bayesian Online Task Selection in LLM reinforcement finetuning. Grounded in Bayesian inference, BOTS adaptively maintains posterior estimates of task difficulty as the model evolves. It jointly incorporates explicit evidence from direct evaluations of selected tasks and implicit evidence inferred from these evaluations for unselected tasks, with Thompson sampling ensuring a principled balance between exploration and exploitation. To make implicit evidence practical, we instantiate it with an ultra-light interpolation-based plug-in that estimates difficulties of unevaluated tasks without extra rollouts, adding negligible overhead. Empirically, across diverse domains and LLM scales, BOTS consistently improves data efficiency and performance over baselines and ablations, providing a practical and extensible solution for dynamic task selection in RFT.
Abstract（参考訳）: 強化微調整(Reinforcement Finetuning, RFT)は、大規模言語モデル(LLM)を人間の嗜好と整合させ、推論を強化するための重要な手法であるが、その効果は訓練中にどのタスクを探索するかに非常に敏感である。均一なタスクサンプリングは非効率であり、簡単なタスクや解決不可能なタスクの計算を浪費するが、既存のタスク選択手法はロールアウトコストが高く、適応性が低い、あるいは不完全なエビデンスに悩まされることが多い。 LLM強化微調整におけるベイズオンラインタスク選択のための統合フレームワークBOTSを紹介する。ベイズ推定に基づいて、BOTSはモデルが進化するにつれてタスク困難の後方推定を適応的に維持する。これは、選択されたタスクの直接評価から明らかな証拠と、選択されていないタスクに対するこれらの評価から推測される暗黙の証拠を共同で含み、トンプソンは探索と搾取の間の原則的なバランスを確保する。暗黙的なエビデンスを実践するために、我々は、余分なロールアウトなしで未評価タスクの難しさを見積もる超軽量補間ベースのプラグインでそれをインスタンス化する。実証的には、さまざまなドメインやLLMスケールでBOTSは、ベースラインや改善点よりもデータ効率とパフォーマンスを一貫して改善し、RFTにおける動的タスク選択のための実用的な拡張可能なソリューションを提供する。

論文の概要: BOTS: A Unified Framework for Bayesian Online Task Selection in LLM Reinforcement Finetuning

関連論文リスト