Soft Decomposed Policy-Critic: Bridging the Gap for Effective Continuous
Control with Discrete RL
- URL: http://arxiv.org/abs/2308.10203v1
- Date: Sun, 20 Aug 2023 08:32:11 GMT
- Title: Soft Decomposed Policy-Critic: Bridging the Gap for Effective Continuous
Control with Discrete RL
- Authors: Yechen Zhang, Jian Sun, Gang Wang, Zhuo Li, Wei Chen
- Abstract summary: We present the Soft Decomposed Policy-Critic (SDPC) architecture, which combines soft RL and actor-critic techniques with discrete RL methods to overcome this limitation.
SDPC discretizes each action dimension independently and employs a shared critic network to maximize the soft $Q$-function.
Our proposed approach outperforms state-of-the-art continuous RL algorithms in a variety of continuous control tasks, including Mujoco's Humanoid and Box2d's BiWalker.
- Score: 47.80205106726076
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Discrete reinforcement learning (RL) algorithms have demonstrated exceptional
performance in solving sequential decision tasks with discrete action spaces,
such as Atari games. However, their effectiveness is hindered when applied to
continuous control problems due to the challenge of dimensional explosion. In
this paper, we present the Soft Decomposed Policy-Critic (SDPC) architecture,
which combines soft RL and actor-critic techniques with discrete RL methods to
overcome this limitation. SDPC discretizes each action dimension independently
and employs a shared critic network to maximize the soft $Q$-function. This
novel approach enables SDPC to support two types of policies: decomposed actors
that lead to the Soft Decomposed Actor-Critic (SDAC) algorithm, and decomposed
$Q$-networks that generate Boltzmann soft exploration policies, resulting in
the Soft Decomposed-Critic Q (SDCQ) algorithm. Through extensive experiments,
we demonstrate that our proposed approach outperforms state-of-the-art
continuous RL algorithms in a variety of continuous control tasks, including
Mujoco's Humanoid and Box2d's BipedalWalker. These empirical results validate
the effectiveness of the SDPC architecture in addressing the challenges
associated with continuous control.
Related papers
- Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions.
We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z) - Distributionally Robust Constrained Reinforcement Learning under Strong Duality [37.76993170360821]
We study the problem of Distributionally Robust Constrained RL (DRC-RL)
The goal is to maximize the expected reward subject to environmental distribution shifts and constraints.
We develop an algorithmic framework based on strong duality that enables the first efficient and provable solution.
arXiv Detail & Related papers (2024-06-22T08:51:57Z) - Two-Stage ML-Guided Decision Rules for Sequential Decision Making under Uncertainty [55.06411438416805]
Sequential Decision Making under Uncertainty (SDMU) is ubiquitous in many domains such as energy, finance, and supply chains.
Some SDMU are naturally modeled as Multistage Problems (MSPs) but the resulting optimizations are notoriously challenging from a computational standpoint.
This paper introduces a novel approach Two-Stage General Decision Rules (TS-GDR) to generalize the policy space beyond linear functions.
The effectiveness of TS-GDR is demonstrated through an instantiation using Deep Recurrent Neural Networks named Two-Stage Deep Decision Rules (TS-LDR)
arXiv Detail & Related papers (2024-05-23T18:19:47Z) - TASAC: a twin-actor reinforcement learning framework with stochastic
policy for batch process control [1.101002667958165]
Reinforcement Learning (RL) wherein an agent learns the policy by directly interacting with the environment, offers a potential alternative in this context.
RL frameworks with actor-critic architecture have recently become popular for controlling systems where state and action spaces are continuous.
It has been shown that an ensemble of actor and critic networks further helps the agent learn better policies due to the enhanced exploration due to simultaneous policy learning.
arXiv Detail & Related papers (2022-04-22T13:00:51Z) - Escaping from Zero Gradient: Revisiting Action-Constrained Reinforcement
Learning via Frank-Wolfe Policy Optimization [5.072893872296332]
Action-constrained reinforcement learning (RL) is a widely-used approach in various real-world applications.
We propose a learning algorithm that decouples the action constraints from the policy parameter update.
We show that the proposed algorithm significantly outperforms the benchmark methods on a variety of control tasks.
arXiv Detail & Related papers (2021-02-22T14:28:03Z) - OPAC: Opportunistic Actor-Critic [0.0]
Opportunistic Actor-Critic (OPAC) is a novel model-free deep RL algorithm that employs better exploration policy and lesser variance.
OPAC combines some of the most powerful features of TD3 and SAC and aims to optimize a policy in an off-policy way.
arXiv Detail & Related papers (2020-12-11T18:33:35Z) - SUNRISE: A Simple Unified Framework for Ensemble Learning in Deep
Reinforcement Learning [102.78958681141577]
We present SUNRISE, a simple unified ensemble method, which is compatible with various off-policy deep reinforcement learning algorithms.
SUNRISE integrates two key ingredients: (a) ensemble-based weighted Bellman backups, which re-weight target Q-values based on uncertainty estimates from a Q-ensemble, and (b) an inference method that selects actions using the highest upper-confidence bounds for efficient exploration.
arXiv Detail & Related papers (2020-07-09T17:08:44Z) - Robust Deep Reinforcement Learning against Adversarial Perturbations on
State Observations [88.94162416324505]
A deep reinforcement learning (DRL) agent observes its states through observations, which may contain natural measurement errors or adversarial noises.
Since the observations deviate from the true states, they can mislead the agent into making suboptimal actions.
We show that naively applying existing techniques on improving robustness for classification tasks, like adversarial training, is ineffective for many RL tasks.
arXiv Detail & Related papers (2020-03-19T17:59:59Z) - Discrete Action On-Policy Learning with Action-Value Critic [72.20609919995086]
Reinforcement learning (RL) in discrete action space is ubiquitous in real-world applications, but its complexity grows exponentially with the action-space dimension.
We construct a critic to estimate action-value functions, apply it on correlated actions, and combine these critic estimated action values to control the variance of gradient estimation.
These efforts result in a new discrete action on-policy RL algorithm that empirically outperforms related on-policy algorithms relying on variance control techniques.
arXiv Detail & Related papers (2020-02-10T04:23:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.