Fugu-MT 論文翻訳(概要): Towards Theoretical Understanding of Inverse Reinforcement Learning

論文の概要: Towards Theoretical Understanding of Inverse Reinforcement Learning

arxiv url: http://arxiv.org/abs/2304.12966v1
Date: Tue, 25 Apr 2023 16:21:10 GMT
ステータス: 翻訳完了
システム内更新日: 2023-04-26 19:48:31.701131
Title: Towards Theoretical Understanding of Inverse Reinforcement Learning
Title（参考訳）: 逆強化学習の理論的理解に向けて
Authors: Alberto Maria Metelli, Filippo Lazzati, Marcello Restelli
Abstract要約: 逆強化学習(IRL)は、専門家が示す振る舞いを正当化する報酬関数を回復するアルゴリズムの強力なファミリーである。本稿では、生成モデルを用いた有限水平問題の場合のIRLの理論ギャップを解消する。
参考スコア（独自算出の注目度）: 45.3190496371625
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Inverse reinforcement learning (IRL) denotes a powerful family of algorithms for recovering a reward function justifying the behavior demonstrated by an expert agent. A well-known limitation of IRL is the ambiguity in the choice of the reward function, due to the existence of multiple rewards that explain the observed behavior. This limitation has been recently circumvented by formulating IRL as the problem of estimating the feasible reward set, i.e., the region of the rewards compatible with the expert's behavior. In this paper, we make a step towards closing the theory gap of IRL in the case of finite-horizon problems with a generative model. We start by formally introducing the problem of estimating the feasible reward set, the corresponding PAC requirement, and discussing the properties of particular classes of rewards. Then, we provide the first minimax lower bound on the sample complexity for the problem of estimating the feasible reward set of order ${\Omega}\Bigl( \frac{H^3SA}{\epsilon^2} \bigl( \log \bigl(\frac{1}{\delta}\bigl) + S \bigl)\Bigl)$, being $S$ and $A$ the number of states and actions respectively, $H$ the horizon, $\epsilon$ the desired accuracy, and $\delta$ the confidence. We analyze the sample complexity of a uniform sampling strategy (US-IRL), proving a matching upper bound up to logarithmic factors. Finally, we outline several open questions in IRL and propose future research directions.
Abstract（参考訳）: 逆強化学習(IRL)は、専門家が示す振る舞いを正当化する報酬関数を回復するアルゴリズムの強力なファミリーである。 IRLのよく知られた制限は、観察された振る舞いを説明する複数の報酬が存在するため、報酬関数の選択の曖昧さである。この制限は、IRLを実現可能な報酬セット、すなわち専門家の行動に適合する報酬の領域を推定する問題として定式化することによって、近年回避されている。本稿では、生成モデルを用いた有限ホライゾン問題において、irlの理論ギャップを閉じる一歩を踏み出す。まず、実現可能な報酬セット、対応するPAC要件を推定し、特定の報酬のクラスの性質を議論する問題を正式に導入することから始める。次に、サンプル複雑性に関する最初のミニマックス下界を、次数${\Omega}\Bigl( \frac{H^3SA}{\epsilon^2} \bigl( \log \bigl(\frac{1}{\delta}\bigl) + S \bigl)\Bigl)$, $S$と$A$のそれぞれ状態と動作の数、水平線$H$、$\epsilon$所望の精度、$\delta$の信頼度を推定する問題に対して与える。均一サンプリング戦略 (us-irl) のサンプル複雑性を分析し, 対数因子に対する上限値の一致を証明した。最後に、IRLにおけるいくつかのオープンな質問について概説し、今後の研究方向性を提案する。

関連論文リスト

Reasoning without Regret [4.07926531936425]
本稿では,スパース結果に基づく報酬を効果的な手順に基づく信号に変換する非回帰フレームワークであるemphBackwards Adaptive Reward Shaping(BARS)を紹介する。我々の分析は, 一般的な連鎖, 連続スケーリング限界, 非線形ファインマン・カック境界に基づいて, 最近の結果に基づく手法の実証的成功と中間管理の利点を結びつけている。
論文参考訳（メタデータ） (2025-04-14T00:34:20Z)
Partial Identifiability and Misspecification in Inverse Reinforcement Learning [64.13583792391783]
Inverse Reinforcement Learning の目的は、報酬関数 $R$ をポリシー $pi$ から推論することである。本稿では,IRLにおける部分的識別性と不特定性について包括的に分析する。
論文参考訳（メタデータ） (2024-11-24T18:35:46Z)
Walking the Values in Bayesian Inverse Reinforcement Learning [66.68997022043075]
ベイズIRLの鍵となる課題は、可能な報酬の仮説空間と可能性の間の計算的ギャップを埋めることである。本稿では,この知見に基づく新しいマルコフ連鎖モンテカルロ法であるValueWalkを提案する。
論文参考訳（メタデータ） (2024-07-15T17:59:52Z)
Uncertainty-Aware Reward-Free Exploration with General Function Approximation [69.27868448449755]
本稿では、algと呼ばれる報酬のない強化学習アルゴリズムを提案する。私たちのアルゴリズムの背後にある重要なアイデアは、環境を探索する上で不確実性を認識した本質的な報酬である。実験の結果、GFA-RFEは最先端の教師なしRLアルゴリズムよりも優れ、あるいは同等であることがわかった。
論文参考訳（メタデータ） (2024-06-24T01:37:18Z)
Horizon-free Reinforcement Learning in Adversarial Linear Mixture MDPs [72.40181882916089]
我々のアルゴリズムが $tildeObig((d+log (|mathcalS|2 |mathcalA|))sqrtKbig)$ regret with full-information feedback, where $d$ is the dimension of a known feature mapping is linearly parametrizing the unknown transition kernel of the MDP, $K$ is the number of episodes, $|mathcalS|$ and $|mathcalA|$ is the standardities of the state and action space。
論文参考訳（メタデータ） (2023-05-15T05:37:32Z)
Provably Feedback-Efficient Reinforcement Learning via Active Reward Learning [26.067411894141863]
報酬関数は、強化学習(RL)における課題を特定する上で、最重要である。 HiL(Human-in-the-loop) RLは、さまざまなフィードバックを提供することで、複雑な目標をRLエージェントに伝達することを可能にする。報奨関数を指定せずに環境を探索する能動的学習に基づくRLアルゴリズムを提案する。
論文参考訳（メタデータ） (2023-04-18T12:36:09Z)
Minimax-Optimal Reward-Agnostic Exploration in Reinforcement Learning [17.239062061431646]
本稿では,強化学習(RL)における報酬非依存探索について検討する。 S$状態、$A$作用、および水平長$H$を持つ有限水平不均一決定過程を考える。我々のアルゴリズムは任意の数の報酬関数に対して$varepsilon$精度を得ることができる。
論文参考訳（メタデータ） (2023-04-14T17:46:49Z)
Fast Rates for Maximum Entropy Exploration [52.946307632704645]
エージェントが未知の環境下で活動し、報酬が得られない場合、強化学習(RL)における探索の課題に対処する。本研究では,最大エントロピー探索問題を2つの異なるタイプで検討する。訪問エントロピーには、$widetildemathcalO(H3S2A/varepsilon2)$ sample complexity を持つゲーム理論アルゴリズムを提案する。軌道エントロピーに対しては,次数$widetildemathcalO(mathrmpoly(S,)の複雑さのサンプルを持つ単純なアルゴリズムを提案する。
論文参考訳（メタデータ） (2023-03-14T16:51:14Z)
Variance-aware robust reinforcement learning with linear function approximation with heavy-tailed rewards [6.932056534450556]
AdaOFUL と VARA という2つのアルゴリズムを,重み付き報酬の存在下でのオンラインシーケンシャルな意思決定のために提案する。 AdaOFULは、$widetildemathcalObigの最先端の後悔境界を達成する。 VarA は $widetildemathcalO(dsqrtHmathcalG*K)$ のより厳密な分散を考慮した後悔境界を達成する。
論文参考訳（メタデータ） (2023-03-09T22:16:28Z)
Reinforcement Learning in Reward-Mixing MDPs [74.41782017817808]
報酬混合マルコフ決定過程(MDP)におけるエピソード強化学習 cdot S2 A2)$ episodes, where$H$ is time-horizon and $S, A$ are the number of state and actions。 epsilon$-optimal policy after $tildeO(poly(H,epsilon-1) cdot S2 A2)$ episodes, $H$ is time-horizon and $S, A$ are the number of state and actions。
論文参考訳（メタデータ） (2021-10-07T18:55:49Z)
A Lower Bound for the Sample Complexity of Inverse Reinforcement Learning [26.384010313580596]
逆強化学習(IRL)は、与えられたマルコフ決定過程(MDP)に対して望ましい最適ポリシーを生成する報酬関数を求めるタスクである。本稿では, 有限状態, 有限作用IRL問題のサンプル複雑性に対する情報理論の下界について述べる。
論文参考訳（メタデータ） (2021-03-07T20:29:10Z)
Adaptive Reward-Free Exploration [48.98199700043158]
提案アルゴリズムは1994年からのFiechterのアルゴリズムの変種と見なすことができる。さらに、報酬のない探索と最高の政治識別の相対的な複雑さについて検討する。
論文参考訳（メタデータ） (2020-06-11T09:58:03Z)
Reward-Free Exploration for Reinforcement Learning [82.3300753751066]
探索の課題を分離する「逆フリーなRL」フレームワークを提案する。我々は,$tildemathcalO(S2Amathrmpoly(H)/epsilon2)$の探索を効率的に行うアルゴリズムを提案する。また、ほぼ一致する$Omega(S2AH2/epsilon2)$ lower boundを与え、この設定でアルゴリズムのほぼ最適性を示す。
論文参考訳（メタデータ） (2020-02-07T14:03:38Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。