Context-Based Soft Actor Critic for Environments with Non-stationary
Dynamics
- URL: http://arxiv.org/abs/2105.03310v2
- Date: Mon, 10 May 2021 09:25:53 GMT
- Title: Context-Based Soft Actor Critic for Environments with Non-stationary
Dynamics
- Authors: Yuan Pu, Shaochen Wang, Xin Yao, Bin Li
- Abstract summary: We propose the Latent Context-based Soft Actor Critic (LC-SAC) method to address aforementioned issues.
By minimizing the contrastive prediction loss function, the learned context variables capture the information of the environment dynamics and the recent behavior of the agent.
Experimental results show that the performance of LC-SAC is significantly better than the SAC algorithm on the MetaWorld ML1 tasks.
- Score: 8.318823695156974
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The performance of deep reinforcement learning methods prone to degenerate
when applied to environments with non-stationary dynamics. In this paper, we
utilize the latent context recurrent encoders motivated by recent Meta-RL
materials, and propose the Latent Context-based Soft Actor Critic (LC-SAC)
method to address aforementioned issues. By minimizing the contrastive
prediction loss function, the learned context variables capture the information
of the environment dynamics and the recent behavior of the agent. Then combined
with the soft policy iteration paradigm, the LC-SAC method alternates between
soft policy evaluation and soft policy improvement until it converges to the
optimal policy. Experimental results show that the performance of LC-SAC is
significantly better than the SAC algorithm on the MetaWorld ML1 tasks whose
dynamics changes drasticly among different episodes, and is comparable to SAC
on the continuous control benchmark task MuJoCo whose dynamics changes slowly
or doesn't change between different episodes. In addition, we also conduct
relevant experiments to determine the impact of different hyperparameter
settings on the performance of the LC-SAC algorithm and give the reasonable
suggestions of hyperparameter setting.
Related papers
- Markov Balance Satisfaction Improves Performance in Strictly Batch Offline Imitation Learning [8.92571113137362]
We address a scenario where the imitator relies solely on observed behavior and cannot make environmental interactions during learning.
Unlike state-of-the-art (SOTA IL) methods, this approach tackles the limitations of conventional IL by operating in a more constrained and realistic setting.
We demonstrate consistently superior empirical performance compared to many SOTA IL algorithms.
arXiv Detail & Related papers (2024-08-17T07:17:19Z) - HarmoDT: Harmony Multi-Task Decision Transformer for Offline Reinforcement Learning [72.25707314772254]
We introduce the Harmony Multi-Task Decision Transformer (HarmoDT), a novel solution designed to identify an optimal harmony subspace of parameters for each task.
The upper level of this framework is dedicated to learning a task-specific mask that delineates the harmony subspace, while the inner level focuses on updating parameters to enhance the overall performance of the unified policy.
arXiv Detail & Related papers (2024-05-28T11:41:41Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Actor-Critic Reinforcement Learning with Phased Actor [10.577516871906816]
We propose a novel phased actor in actor-critic (PAAC) method to improve policy gradient estimation.
PAAC accounts for both $Q$ value and TD error in its actor update.
Results show that PAAC leads to significant performance improvement measured by total cost, learning variance, robustness, learning speed and success rate.
arXiv Detail & Related papers (2024-04-18T01:27:31Z) - Non-Parametric Stochastic Policy Gradient with Strategic Retreat for
Non-Stationary Environment [1.5229257192293197]
We propose a systematic methodology to learn a sequence of optimal control policies non-parametrically.
Our methodology has outperformed the well-established DDPG and TD3 methodology by a sizeable margin in terms of learning performance.
arXiv Detail & Related papers (2022-03-24T21:41:13Z) - Soft Actor-Critic with Cross-Entropy Policy Optimization [0.45687771576879593]
We propose Soft Actor-Critic with Cross-Entropy Policy Optimization (SAC-CEPO)
SAC-CEPO uses Cross-Entropy Method (CEM) to optimize the policy network of SAC.
We show that SAC-CEPO achieves competitive performance against the original SAC.
arXiv Detail & Related papers (2021-12-21T11:38:12Z) - Learning Robust Policy against Disturbance in Transition Dynamics via
State-Conservative Policy Optimization [63.75188254377202]
Deep reinforcement learning algorithms can perform poorly in real-world tasks due to discrepancy between source and target environments.
We propose a novel model-free actor-critic algorithm to learn robust policies without modeling the disturbance in advance.
Experiments in several robot control tasks demonstrate that SCPO learns robust policies against the disturbance in transition dynamics.
arXiv Detail & Related papers (2021-12-20T13:13:05Z) - Improved Soft Actor-Critic: Mixing Prioritized Off-Policy Samples with
On-Policy Experience [9.06635747612495]
Soft Actor-Critic (SAC) is an off-policy actor-critic reinforcement learning algorithm.
SAC trains a policy by maximizing the trade-off between expected return and entropy.
It has achieved state-of-the-art performance on a range of continuous-control benchmark tasks.
arXiv Detail & Related papers (2021-09-24T06:46:28Z) - Robust Value Iteration for Continuous Control Tasks [99.00362538261972]
When transferring a control policy from simulation to a physical system, the policy needs to be robust to variations in the dynamics to perform well.
We present Robust Fitted Value Iteration, which uses dynamic programming to compute the optimal value function on the compact state domain.
We show that robust value is more robust compared to deep reinforcement learning algorithm and the non-robust version of the algorithm.
arXiv Detail & Related papers (2021-05-25T19:48:35Z) - Policy Information Capacity: Information-Theoretic Measure for Task
Complexity in Deep Reinforcement Learning [83.66080019570461]
We propose two environment-agnostic, algorithm-agnostic quantitative metrics for task difficulty.
We show that these metrics have higher correlations with normalized task solvability scores than a variety of alternatives.
These metrics can also be used for fast and compute-efficient optimizations of key design parameters.
arXiv Detail & Related papers (2021-03-23T17:49:50Z) - Kalman meets Bellman: Improving Policy Evaluation through Value Tracking [59.691919635037216]
Policy evaluation is a key process in Reinforcement Learning (RL)
We devise an optimization method, called Kalman Optimization for Value Approximation (KOVA)
KOVA minimizes a regularized objective function that concerns both parameter and noisy return uncertainties.
arXiv Detail & Related papers (2020-02-17T13:30:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.