Fugu-MT 論文翻訳(概要): Is Reinforcement Learning (Not) for Natural Language Processing?: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization

論文の概要: Is Reinforcement Learning (Not) for Natural Language Processing?: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization

arxiv url: http://arxiv.org/abs/2210.01241v1
Date: Mon, 3 Oct 2022 21:38:29 GMT
ステータス: 翻訳完了
システム内更新日: 2022-10-05 13:23:27.223969
Title: Is Reinforcement Learning (Not) for Natural Language Processing?: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization
Title（参考訳）: 自然言語処理のための強化学習(not)は? 自然言語政策最適化のためのベンチマーク・ベースライン・ビルディングブロック
Authors: Rajkumar Ramamurthy, Prithviraj Ammanabrolu, Kiant\'e Brantley, Jack Hessel, Rafet Sifa, Christian Bauckhage, Hannaneh Hajishirzi, Yejin Choi
Abstract要約: 我々は、強化学習による言語生成を最適化するためのオープンソースのモジュールライブラリRL4LMを紹介する。次に、ターゲット文字列ではなく、報酬関数によって教師される6つの言語生成タスクのセットであるGRUEベンチマークを示す。最後に,言語生成における動作空間を効果的に削減するNLPOアルゴリズムを提案する。
参考スコア（独自算出の注目度）: 73.74371798168642
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We tackle the problem of aligning pre-trained large language models (LMs) with human preferences. If we view text generation as a sequential decision-making problem, reinforcement learning (RL) appears to be a natural conceptual framework. However, using RL for LM-based generation faces empirical challenges, including training instability due to the combinatorial action space, as well as a lack of open-source libraries and benchmarks customized for LM alignment. Thus, a question rises in the research community: is RL a practical paradigm for NLP? To help answer this, we first introduce an open-source modular library, RL4LMs (Reinforcement Learning for Language Models), for optimizing language generators with RL. The library consists of on-policy RL algorithms that can be used to train any encoder or encoder-decoder LM in the HuggingFace library (Wolf et al. 2020) with an arbitrary reward function. Next, we present the GRUE (General Reinforced-language Understanding Evaluation) benchmark, a set of 6 language generation tasks which are supervised not by target strings, but by reward functions which capture automated measures of human preference.GRUE is the first leaderboard-style evaluation of RL algorithms for NLP tasks. Finally, we introduce an easy-to-use, performant RL algorithm, NLPO (Natural Language Policy Optimization)} that learns to effectively reduce the combinatorial action space in language generation. We show 1) that RL techniques are generally better than supervised methods at aligning LMs to human preferences; and 2) that NLPO exhibits greater stability and performance than previous policy gradient methods (e.g., PPO (Schulman et al. 2017)), based on both automatic and human evaluation.
Abstract（参考訳）: 我々は、事前訓練された大規模言語モデル(LM)と人間の嗜好を整合させる問題に取り組む。テキスト生成を逐次意思決定問題とみなす場合、強化学習(RL)は自然な概念的枠組みであると考えられる。しかし、LMベースの生成にRLを使用することは、組合せアクション空間によるトレーニング不安定性や、LMアライメント用にカスタマイズされたオープンソースライブラリやベンチマークの欠如など、経験的な課題に直面している。 RLはNLPの実践パラダイムなのだろうか? この問題を解決するために、まずオープンソースのモジュールライブラリRL4LM(Reinforcement Learning for Language Models)を導入し、RLで言語ジェネレータを最適化する。このライブラリはオンポリシーrlアルゴリズムで構成されており、任意の報酬関数を備えたhughingfaceライブラリ(wolf et al. 2020)でエンコーダやエンコーダデコーダlmのトレーニングに使用することができる。次に、GRUE(General Reinforced- Language Understanding Evaluation)ベンチマークを提案する。このベンチマークは、ターゲット文字列ではなく、人間の嗜好の自動測定をキャプチャする報酬関数によって教師される6つの言語生成タスクのセットである。最後に,言語生成における組合せ的動作空間を効果的に削減することを学ぶために,使い易く高性能なrlアルゴリズムであるnlpo(natural language policy optimization)を提案する。展示 1)RL法は一般に、LMをヒトの嗜好に合わせるための監督方法よりも優れている。 2) NLPOは, 従来の政策勾配法(例えば, PPO (Schulman et al. 2017))よりも, 自動評価と人的評価の両方に基づいて, 安定性と性能を示す。

論文の概要: Is Reinforcement Learning (Not) for Natural Language Processing?: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization

関連論文リスト