Fugu-MT 論文翻訳(概要): Effects of sparse rewards of different magnitudes in the speed of learning of model-based actor critic methods

論文の概要: Effects of sparse rewards of different magnitudes in the speed of learning of model-based actor critic methods

arxiv url: http://arxiv.org/abs/2001.06725v1
Date: Sat, 18 Jan 2020 20:52:05 GMT
ステータス: 翻訳完了
システム内更新日: 2023-01-10 05:14:13.407975
Title: Effects of sparse rewards of different magnitudes in the speed of learning of model-based actor critic methods
Title（参考訳）: モデルに基づくアクター批判法の学習速度に及ぼす異なる大きさのスパース報酬の影響
Authors: Juan Vargas, Lazar Andjelic, Amir Barati Farimani
Abstract要約: トレーニング中に外部環境圧力を適用することで,エージェントがより速く学習できることを示す。 Hindsight Experience Replay を用いた、よく知られた Mujoco 環境におけるDeep Deterministic Policy Gradients の有効性が示されている。
参考スコア（独自算出の注目度）: 0.4640835690336653
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Actor critic methods with sparse rewards in model-based deep reinforcement learning typically require a deterministic binary reward function that reflects only two possible outcomes: if, for each step, the goal has been achieved or not. Our hypothesis is that we can influence an agent to learn faster by applying an external environmental pressure during training, which adversely impacts its ability to get higher rewards. As such, we deviate from the classical paradigm of sparse rewards and add a uniformly sampled reward value to the baseline reward to show that (1) sample efficiency of the training process can be correlated to the adversity experienced during training, (2) it is possible to achieve higher performance in less time and with less resources, (3) we can reduce the performance variability experienced seed over seed, (4) there is a maximum point after which more pressure will not generate better results, and (5) that random positive incentives have an adverse effect when using a negative reward strategy, making an agent under those conditions learn poorly and more slowly. These results have been shown to be valid for Deep Deterministic Policy Gradients using Hindsight Experience Replay in a well known Mujoco environment, but we argue that they could be generalized to other methods and environments as well.
Abstract（参考訳）: モデルベースの深層強化学習では,アクタの批判的手法は,通常,2つの可能な結果のみを反映した決定論的バイナリ報酬関数を必要とする。我々の仮説は、トレーニング中に外部の環境圧力を適用することで、エージェントにより速く学習させることが、より高い報酬を得る能力に悪影響を及ぼすというものである。 As such, we deviate from the classical paradigm of sparse rewards and add a uniformly sampled reward value to the baseline reward to show that (1) sample efficiency of the training process can be correlated to the adversity experienced during training, (2) it is possible to achieve higher performance in less time and with less resources, (3) we can reduce the performance variability experienced seed over seed, (4) there is a maximum point after which more pressure will not generate better results, and (5) that random positive incentives have an adverse effect when using a negative reward strategy, making an agent under those conditions learn poorly and more slowly. これらの結果は、よく知られたMujoco環境における隠れ経験リプレイを用いたDeep Deterministic Policy Gradients(Deep Deterministic Policy Gradients)に有効であることが示されているが、他の手法や環境にも一般化できると論じている。

関連論文リスト

Reducing Reward Dependence in RL Through Adaptive Confidence Discounting [0.0]
環境状態における行動の価値の知識が低い場合にのみ報酬を要求できる新しい強化学習アルゴリズムを提供する。高価な報酬への依存を減らすことで、報酬を得るための物流や費用が禁止されるような環境で、効率的に学習することができる。
論文参考訳（メタデータ） (2025-02-28T15:58:21Z)
DreamSmooth: Improving Model-based Reinforcement Learning via Reward Smoothing [60.21269454707625]
DreamSmoothは、与えられたタイミングでの正確な報酬ではなく、時間的に滑らかな報酬を予測することを学ぶ。本研究では,DreamSmoothが長時間のスパース・リワードタスクにおいて最先端のパフォーマンスを達成することを示す。
論文参考訳（メタデータ） (2023-11-02T17:57:38Z)
The Enemy of My Enemy is My Friend: Exploring Inverse Adversaries for Improving Adversarial Training [72.39526433794707]
敵の訓練とその変種は、敵の例に対抗して最も効果的なアプローチであることが示されている。本稿では,モデルが類似した出力を生成することを奨励する,新たな対角訓練手法を提案する。本手法は,最先端のロバスト性および自然な精度を実現する。
論文参考訳（メタデータ） (2022-11-01T15:24:26Z)
Distributional Reward Estimation for Effective Multi-Agent Deep Reinforcement Learning [19.788336796981685]
実効的マルチエージェント強化学習(DRE-MARL)のための分散逆推定フレームワークを提案する。本研究の目的は,安定トレーニングのための多行動分岐報酬推定と政策重み付け報酬アグリゲーションを設計することである。 DRE-MARLの優位性は,有効性とロバスト性の両方の観点から,SOTAベースラインと比較して,ベンチマークマルチエージェントシナリオを用いて実証される。
論文参考訳（メタデータ） (2022-10-14T08:31:45Z)
Self-Supervised Exploration via Temporal Inconsistency in Reinforcement Learning [17.360622968442982]
我々は,人間の学習に触発された新たな本質的な報奨を,現在の観察と歴史知識を比較して好奇心を評価することによって提示する。提案手法は,自己教師付き予測モデルのトレーニング,モデルパラメータのスナップショットの保存,および核ノルムを用いて,異なるスナップショットの予測間の時間的矛盾を本質的な報酬として評価することを含む。
論文参考訳（メタデータ） (2022-08-24T08:19:41Z)
Imitating Past Successes can be Very Suboptimal [145.70788608016755]
既存の結果条件付き模倣学習手法が必ずしもポリシーを改善できないことを示す。簡単な修正が、政策改善を保証する方法をもたらすことを示す。我々の目的は、全く新しい方法を開発するのではなく、成果条件付き模倣学習の変種が報酬を最大化するためにどのように使用できるかを説明することである。
論文参考訳（メタデータ） (2022-06-07T15:13:43Z)
Causal Confusion and Reward Misidentification in Preference-Based Reward Learning [33.944367978407904]
選好から学習する際の因果的混乱と報酬的誤認について検討した。その結果,非因果的障害の特徴,優先条件のノイズ,部分的状態観察性の存在が,報酬の誤認を悪化させることが判明した。
論文参考訳（メタデータ） (2022-04-13T18:41:41Z)
Imitating, Fast and Slow: Robust learning from demonstrations via decision-time planning [96.72185761508668]
テストタイムでの計画(IMPLANT)は、模倣学習のための新しいメタアルゴリズムである。 IMPLANTは,標準制御環境において,ベンチマーク模倣学習手法よりも優れていることを示す。
論文参考訳（メタデータ） (2022-04-07T17:16:52Z)
SURF: Semi-supervised Reward Learning with Data Augmentation for Feedback-efficient Preference-based Reinforcement Learning [168.89470249446023]
我々は、大量のラベルなしサンプルとデータ拡張を利用する半教師付き報酬学習フレームワークSURFを提案する。報奨学習にラベルのないサンプルを活用するために,選好予測器の信頼性に基づいてラベルのないサンプルの擬似ラベルを推定する。本実験は, ロボット操作作業における嗜好に基づく手法のフィードバック効率を有意に向上させることを実証した。
論文参考訳（メタデータ） (2022-03-18T16:50:38Z)
Imitation Learning by State-Only Distribution Matching [2.580765958706854]
観察からの模倣学習は、人間の学習と同様の方法で政策学習を記述する。本稿では,解釈可能な収束度と性能測定値とともに,非逆学習型観測手法を提案する。
論文参考訳（メタデータ） (2022-02-09T08:38:50Z)
Combating False Negatives in Adversarial Imitation Learning [67.99941805086154]
敵対的模倣学習では、エージェントエピソードと、所望の行動を表す専門家のデモンストレーションとを区別するために、判別器を訓練する。訓練された方針がより成功することを学ぶと、負の例は専門家の例とますます似ている。本研究では,偽陰性の影響を緩和し,BabyAI環境上で検証する手法を提案する。
論文参考訳（メタデータ） (2020-02-02T14:56:39Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。