Fugu-MT 論文翻訳(概要): Continuous-Time Fitted Value Iteration for Robust Policies

論文の概要: Continuous-Time Fitted Value Iteration for Robust Policies

arxiv url: http://arxiv.org/abs/2110.01954v1
Date: Tue, 5 Oct 2021 11:33:37 GMT
ステータス: 翻訳完了
システム内更新日: 2021-10-06 13:59:34.993080
Title: Continuous-Time Fitted Value Iteration for Robust Policies
Title（参考訳）: ロバストなポリシーのための連続時間適合価値イテレーション
Authors: Michael Lutter, Boris Belousov, Shie Mannor, Dieter Fox, Animesh Garg, Jan Peters
Abstract要約: ハミルトン・ヤコビ・ベルマン方程式の解法は、制御、ロボティクス、経済学を含む多くの領域において重要である。連続適合値反復(cFVI)とロバスト適合値反復(rFVI)を提案する。これらのアルゴリズムは、多くの連続制御問題の非線形制御-アフィンダイナミクスと分離可能な状態とアクション報酬を利用する。
参考スコア（独自算出の注目度）: 93.25997466553929
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Solving the Hamilton-Jacobi-Bellman equation is important in many domains including control, robotics and economics. Especially for continuous control, solving this differential equation and its extension the Hamilton-Jacobi-Isaacs equation, is important as it yields the optimal policy that achieves the maximum reward on a give task. In the case of the Hamilton-Jacobi-Isaacs equation, which includes an adversary controlling the environment and minimizing the reward, the obtained policy is also robust to perturbations of the dynamics. In this paper we propose continuous fitted value iteration (cFVI) and robust fitted value iteration (rFVI). These algorithms leverage the non-linear control-affine dynamics and separable state and action reward of many continuous control problems to derive the optimal policy and optimal adversary in closed form. This analytic expression simplifies the differential equations and enables us to solve for the optimal value function using value iteration for continuous actions and states as well as the adversarial case. Notably, the resulting algorithms do not require discretization of states or actions. We apply the resulting algorithms to the Furuta pendulum and cartpole. We show that both algorithms obtain the optimal policy. The robustness Sim2Real experiments on the physical systems show that the policies successfully achieve the task in the real-world. When changing the masses of the pendulum, we observe that robust value iteration is more robust compared to deep reinforcement learning algorithm and the non-robust version of the algorithm. Videos of the experiments are shown at https://sites.google.com/view/rfvi
Abstract（参考訳）: ハミルトン・ヤコビ・ベルマン方程式の解法は制御、ロボット工学、経済学など多くの分野において重要である。特に連続制御の場合、この微分方程式とその拡張であるハミルトン・ヤコビ・イザックス方程式は、与えられたタスクに対する最大報酬を達成する最適なポリシーをもたらすため重要である。環境を制御し、報酬を最小化する敵を含むハミルトン・ヤコビ・isaacs方程式の場合、得られるポリシーは力学の摂動にも頑健である。本稿では, 連続適合値反復 (cFVI) とロバスト適合値反復 (rFVI) を提案する。これらのアルゴリズムは、多くの連続制御問題の非線形制御-アフィン力学と分離可能な状態と作用報酬を利用して、閉形式の最適ポリシーと最適逆数を引き出す。この解析式は微分方程式を単純化し、連続的な動作や状態に対する値反復と逆の場合の最適値関数を解くことができる。特に、結果のアルゴリズムは状態やアクションの離散化を必要としない。結果のアルゴリズムを古田振り子とカートポールに適用する。両者のアルゴリズムが最適方針を得ることを示す。物理システムにおけるロバスト性 Sim2Real 実験により, 実世界の課題の実現に成功していることが示された。振り子の質量を変化させる際,強化学習アルゴリズムや非ロバスト版のアルゴリズムに比べてロバストな値反復がより頑健であることを観察する。実験のビデオはhttps://sites.google.com/view/rfviで見ることができる。

論文の概要: Continuous-Time Fitted Value Iteration for Robust Policies

関連論文リスト