Fugu-MT 論文翻訳(概要): Sample-Efficient Reinforcement Learning from Human Feedback via Information-Directed Sampling

論文の概要: Sample-Efficient Reinforcement Learning from Human Feedback via Information-Directed Sampling

arxiv url: http://arxiv.org/abs/2502.05434v1
Date: Sat, 08 Feb 2025 03:47:00 GMT
ステータス: 翻訳完了
システム内更新日: 2025-02-11 18:57:49.646952
Title: Sample-Efficient Reinforcement Learning from Human Feedback via Information-Directed Sampling
Title（参考訳）: 情報指向サンプリングによる人間のフィードバックからのサンプル効率の良い強化学習
Authors: Han Qi, Haochen Yang, Qiaosheng Zhang, Zhuoran Yang,
Abstract要約: 本研究では,大規模言語モデルの学習において重要な課題である,人間からのフィードバック(RLHF)による強化学習の課題について検討する。我々の主な貢献は、情報指向サンプリング(IDS)に基づく新しいサンプル効率RLHFアルゴリズムの設計である。本研究は、強化学習と大規模言語モデルの訓練における情報理論の価値を示す。
参考スコア（独自算出の注目度）: 46.035795210898414
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study the problem of reinforcement learning from human feedback (RLHF), a critical problem in training large language models, from a theoretical perspective. Our main contribution is the design of novel sample-efficient RLHF algorithms based on information-directed sampling (IDS), an online decision-making principle inspired by information theory. Our algorithms maximize the sum of the value function and a mutual information term that encourages exploration of the unknown environment (which quantifies the information gained about the environment through observed human feedback data). To tackle the challenge of large state spaces and improve sample efficiency, we construct a simplified \emph{surrogate environment} and introduce a novel distance measure (named the \emph{$\ell_g$-distance}), enabling our IDS-based algorithm to achieve a Bayesian regret upper bound of order $O(H^{\frac{3}{2}}\sqrt{\log(K(\epsilon)) T})$, where $H$ is the episode length, $T$ is the number of episode and $K(\epsilon)$ is related to the covering number of the environment. Specializing to the tabular settings, this regret bound is of order $\tilde{O}(H^2\sqrt{SAT})$, where $S$ and $A$ are the numbers of states and actions. Finally, we propose an Approximate-IDS algorithm that is computationally more efficient while maintaining nearly the same sample efficiency. The design principle of this approximate algorithm is not only effective in RLHF settings but also applicable to the standard RL framework. Moreover, our work showcases the value of information theory in reinforcement learning and in the training of large language models.
Abstract（参考訳）: 本稿では,人間からのフィードバック(RLHF)による強化学習の課題について理論的観点から検討する。我々の主な貢献は、情報理論にインスパイアされたオンライン意思決定原理である情報指向サンプリング(IDS)に基づく、新しいサンプル効率のRLHFアルゴリズムの設計である。提案アルゴリズムは,評価関数の和を最大化し,未知環境の探索を奨励する相互情報項(観察された人間のフィードバックデータを通して得られた環境に関する情報を定量化する)を最大化する。大規模な状態空間の課題に対処し、サンプル効率を向上させるため、単純化された \emph{surrogate environment} を構築し、新しい距離測度( \emph{$\ell_g$-distance} と呼ばれる)を導入し、我々のIDSベースのアルゴリズムがベイズ的後悔の上界の次数$O(H^{\frac{3}{2}}\sqrt{\log(K(\epsilon))T})$、$H$はエピソード長、$T$はエピソード数、$K(\epsilon)$は環境のカバー数に関連付けることができる。表の設定に特化して、この後悔のバウンダリは$\tilde{O}(H^2\sqrt{SAT})$で、$S$と$A$は状態とアクションの数である。最後に,ほぼ同じサンプル効率を維持しつつ,計算効率が向上する近似IDSアルゴリズムを提案する。この近似アルゴリズムの設計原理は、RLHF設定だけでなく、標準のRLフレームワークにも有効である。さらに,本研究は,強化学習と大規模言語モデルの訓練における情報理論の価値を示す。

論文の概要: Sample-Efficient Reinforcement Learning from Human Feedback via Information-Directed Sampling

関連論文リスト