Fugu-MT 論文翻訳(概要): On the Convergence and Stability of Upside-Down Reinforcement Learning, Goal-Conditioned Supervised Learning, and Online Decision Transformers

論文の概要: On the Convergence and Stability of Upside-Down Reinforcement Learning, Goal-Conditioned Supervised Learning, and Online Decision Transformers

arxiv url: http://arxiv.org/abs/2502.05672v1
Date: Sat, 08 Feb 2025 19:26:22 GMT
ステータス: 翻訳完了
システム内更新日: 2025-02-11 18:57:50.223997
Title: On the Convergence and Stability of Upside-Down Reinforcement Learning, Goal-Conditioned Supervised Learning, and Online Decision Transformers
Title（参考訳）: アップサイドダウン強化学習, ゴールコンディション付き指導学習, オンライン意思決定変換器の収束性と安定性について
Authors: Miroslav Štrupl, Oleg Szehr, Francesco Faccio, Dylan R. Ashley, Rupesh Kumar Srivastava, Jürgen Schmidhuber,
Abstract要約: 本稿は,表意的なアップサイドダウン強化学習,ゴール・コンディションド・スーパービジョン学習,オンライン決定変換器の収束と安定性を厳密に分析する。
参考スコア（独自算出の注目度）: 25.880499561355904
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This article provides a rigorous analysis of convergence and stability of Episodic Upside-Down Reinforcement Learning, Goal-Conditioned Supervised Learning and Online Decision Transformers. These algorithms performed competitively across various benchmarks, from games to robotic tasks, but their theoretical understanding is limited to specific environmental conditions. This work initiates a theoretical foundation for algorithms that build on the broad paradigm of approaching reinforcement learning through supervised learning or sequence modeling. At the core of this investigation lies the analysis of conditions on the underlying environment, under which the algorithms can identify optimal solutions. We also assess whether emerging solutions remain stable in situations where the environment is subject to tiny levels of noise. Specifically, we study the continuity and asymptotic convergence of command-conditioned policies, values and the goal-reaching objective depending on the transition kernel of the underlying Markov Decision Process. We demonstrate that near-optimal behavior is achieved if the transition kernel is located in a sufficiently small neighborhood of a deterministic kernel. The mentioned quantities are continuous (with respect to a specific topology) at deterministic kernels, both asymptotically and after a finite number of learning cycles. The developed methods allow us to present the first explicit estimates on the convergence and stability of policies and values in terms of the underlying transition kernels. On the theoretical side we introduce a number of new concepts to reinforcement learning, like working in segment spaces, studying continuity in quotient topologies and the application of the fixed-point theory of dynamical systems. The theoretical study is accompanied by a detailed investigation of example environments and numerical experiments.
Abstract（参考訳）: 本稿は,表意的なアップサイドダウン強化学習,ゴール・コンディションド・スーパービジョン学習,オンライン決定変換器の収束と安定性を厳密に分析する。これらのアルゴリズムは、ゲームからロボットタスクまで様々なベンチマークで競合的に実行されたが、理論的な理解は特定の環境条件に限定されている。この研究は、教師付き学習やシーケンスモデリングを通じて強化学習にアプローチする幅広いパラダイムに基づくアルゴリズムの理論的基盤を開始する。この研究の核心は、アルゴリズムが最適解を識別できる基盤環境の条件の分析である。また、環境が騒音の少ない環境では、新興のソリューションが安定しているかどうかも評価する。具体的には,マルコフ決定過程の遷移カーネルに依存するコマンド条件付きポリシー,値,目標達成目標の連続性と漸近収束について検討する。遷移カーネルが決定論的カーネルの十分小さな近傍にある場合、準最適挙動が達成されることを示す。上記の量は(特定のトポロジーに関して)決定論的核において連続であり、漸近的かつ有限個の学習サイクルの後である。提案手法により,基本となる遷移カーネルの観点でポリシと値の収束と安定性について,最初の明示的な推定値を示すことができる。理論面では、セグメント空間での作業、商位相の連続性の研究、力学系の不動点理論の適用など、強化学習に新しい概念をいくつか導入する。理論的研究には、サンプル環境の詳細な調査と数値実験が伴う。

関連論文リスト

Reinforcement Learning in Switching Non-Stationary Markov Decision Processes: Algorithms and Convergence Analysis [6.399565088857091]
そこで我々は,背景となるマルコフ連鎖に基づいて,環境が時間とともに切り替わる,スイッチング非定常マルコフ決定プロセス(SNS-MDP)を紹介した。固定されたポリシーの下では、SNS-MDPの値関数はマルコフ連鎖の統計的性質によって決定される閉形式解を認める。このフレームワークは、複雑な時間変化の文脈で意思決定を効果的に導くことができるかを示す。
論文参考訳（メタデータ） (2025-03-24T12:05:30Z)
Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
政策勾配法(PG法)は連続強化学習(RL法)問題に対処する手法として成功している。一般的には、収束(ハイパー)政治は、決定論的バージョンをデプロイするためにのみ学習される。本稿では,サンプルの複雑性とデプロイされた決定論的ポリシのパフォーマンスのトレードオフを最適化するために,学習に使用する探索レベルの調整方法を示す。
論文参考訳（メタデータ） (2024-05-03T16:45:15Z)
Q-Learning for Stochastic Control under General Information Structures and Non-Markovian Environments [1.90365714903665]
反復に対する収束定理を提示し、特に一般の、おそらくは非マルコフ的環境下でのQ学習を反復する。非マルコフ環境における様々な制御問題に対するこの定理の意義と応用について論じる。
論文参考訳（メタデータ） (2023-10-31T19:53:16Z)
Probabilistic Reach-Avoid for Bayesian Neural Networks [71.67052234622781]
最適合成アルゴリズムは、証明された状態の数を4倍以上に増やすことができることを示す。このアルゴリズムは、平均的な到達回避確率を3倍以上に向上させることができる。
論文参考訳（メタデータ） (2023-10-03T10:52:21Z)
Conditional Kernel Imitation Learning for Continuous State Environments [9.750698192309978]
条件付きカーネル密度推定に基づく新しい模倣学習フレームワークを提案する。我々は、多くの最先端ILアルゴリズムよりも一貫して優れた経験的性能を示す。
論文参考訳（メタデータ） (2023-08-24T05:26:42Z)
Can Decentralized Stochastic Minimax Optimization Algorithms Converge Linearly for Finite-Sum Nonconvex-Nonconcave Problems? [56.62372517641597]
分散化されたミニマックス最適化は、幅広い機械学習に応用されているため、ここ数年で活発に研究されている。本稿では,非コンカブ問題に対する2つの新しい分散化ミニマックス最適化アルゴリズムを提案する。
論文参考訳（メタデータ） (2023-04-24T02:19:39Z)
Performative Reinforcement Learning [8.07595093287034]
実演安定政策の概念を導入する。この目的を何度も最適化することは、性能的に安定した政策に収束することを示します。
論文参考訳（メタデータ） (2022-06-30T18:26:03Z)
Towards Robust Bisimulation Metric Learning [3.42658286826597]
ビシミュレーションメトリクスは、表現学習問題に対する一つの解決策を提供する。非最適ポリシーへのオン・ポリティクス・バイシミュレーション・メトリクスの値関数近似境界を一般化する。これらの問題は、制約の少ない力学モデルと、報酬信号への埋め込みノルムの不安定な依存に起因する。
論文参考訳（メタデータ） (2021-10-27T00:32:07Z)
Policy Gradient for Continuing Tasks in Non-stationary Markov Decision Processes [112.38662246621969]
強化学習は、マルコフ決定プロセスにおいて期待される累積報酬を最大化するポリシーを見つけることの問題を考える。我々は、ポリシーを更新するために上昇方向として使用する値関数の偏りのないナビゲーション勾配を計算する。ポリシー勾配型アルゴリズムの大きな欠点は、定常性の仮定が課せられない限り、それらがエピソジックなタスクに限定されていることである。
論文参考訳（メタデータ） (2020-10-16T15:15:42Z)
A Distributional Analysis of Sampling-Based Reinforcement Learning Algorithms [67.67377846416106]
定常ステップサイズに対する強化学習アルゴリズムの理論解析に対する分布的アプローチを提案する。本稿では,TD($lambda$)や$Q$-Learningのような値ベースの手法が,関数の分布空間で制約のある更新ルールを持つことを示す。
論文参考訳（メタデータ） (2020-03-27T05:13:29Z)
Optimization with Momentum: Dynamical, Control-Theoretic, and Symplectic Perspectives [97.16266088683061]
この論文は、運動量に基づく最適化アルゴリズムにおいてシンプレクティックな離散化スキームが重要であることを厳格に証明している。これは加速収束を示すアルゴリズムの特性を提供する。
論文参考訳（メタデータ） (2020-02-28T00:32:47Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。