Quasi-optimal Reinforcement Learning with Continuous Actions
- URL: http://arxiv.org/abs/2301.08940v2
- Date: Mon, 2 Oct 2023 00:55:29 GMT
- Title: Quasi-optimal Reinforcement Learning with Continuous Actions
- Authors: Yuhan Li, Wenzhuo Zhou, Ruoqing Zhu
- Abstract summary: We develop a novel emphquasi-optimal learning algorithm, which can be easily optimized in off-policy settings.
We evaluate our algorithm with comprehensive simulated experiments and a dose suggestion real application to Ohio Type 1 diabetes dataset.
- Score: 8.17049210746654
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many real-world applications of reinforcement learning (RL) require making
decisions in continuous action environments. In particular, determining the
optimal dose level plays a vital role in developing medical treatment regimes.
One challenge in adapting existing RL algorithms to medical applications,
however, is that the popular infinite support stochastic policies, e.g.,
Gaussian policy, may assign riskily high dosages and harm patients seriously.
Hence, it is important to induce a policy class whose support only contains
near-optimal actions, and shrink the action-searching area for effectiveness
and reliability. To achieve this, we develop a novel \emph{quasi-optimal
learning algorithm}, which can be easily optimized in off-policy settings with
guaranteed convergence under general function approximations. Theoretically, we
analyze the consistency, sample complexity, adaptability, and convergence of
the proposed algorithm. We evaluate our algorithm with comprehensive simulated
experiments and a dose suggestion real application to Ohio Type 1 diabetes
dataset.
Related papers
- Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Offline Policy Optimization with Eligible Actions [34.4530766779594]
offline policy optimization could have a large impact on many real-world decision-making problems.
Importance sampling and its variants are a commonly used type of estimator in offline policy evaluation.
We propose an algorithm to avoid this overfitting through a new per-state-neighborhood normalization constraint.
arXiv Detail & Related papers (2022-07-01T19:18:15Z) - Federated Offline Reinforcement Learning [55.326673977320574]
We propose a multi-site Markov decision process model that allows for both homogeneous and heterogeneous effects across sites.
We design the first federated policy optimization algorithm for offline RL with sample complexity.
We give a theoretical guarantee for the proposed algorithm, where the suboptimality for the learned policies is comparable to the rate as if data is not distributed.
arXiv Detail & Related papers (2022-06-11T18:03:26Z) - A Prescriptive Dirichlet Power Allocation Policy with Deep Reinforcement
Learning [6.003234406806134]
In this work, we propose the Dirichlet policy for continuous allocation tasks and analyze the bias and variance of its policy gradients.
We demonstrate that the Dirichlet policy is bias-free and provides significantly faster convergence and better performance than the Gaussian-softmax policy.
The experimental results show the potential to prescribe optimal operation, improve the efficiency and sustainability of multi-power source systems.
arXiv Detail & Related papers (2022-01-20T20:41:04Z) - Estimating Optimal Infinite Horizon Dynamic Treatment Regimes via
pT-Learning [2.0625936401496237]
Recent advances in mobile health (mHealth) technology provide an effective way to monitor individuals' health statuses and deliver just-in-time personalized interventions.
The practical use of mHealth technology raises unique challenges to existing methodologies on learning an optimal dynamic treatment regime.
We propose a Proximal Temporal Learning framework to estimate an optimal regime adaptively adjusted between deterministic and sparse policy models.
arXiv Detail & Related papers (2021-10-20T18:38:22Z) - Policy Mirror Descent for Regularized Reinforcement Learning: A
Generalized Framework with Linear Convergence [60.20076757208645]
This paper proposes a general policy mirror descent (GPMD) algorithm for solving regularized RL.
We demonstrate that our algorithm converges linearly over an entire range learning rates, in a dimension-free fashion, to the global solution.
arXiv Detail & Related papers (2021-05-24T02:21:34Z) - Learning Sampling Policy for Faster Derivative Free Optimization [100.27518340593284]
We propose a new reinforcement learning based ZO algorithm (ZO-RL) with learning the sampling policy for generating the perturbations in ZO optimization instead of using random sampling.
Our results show that our ZO-RL algorithm can effectively reduce the variances of ZO gradient by learning a sampling policy, and converge faster than existing ZO algorithms in different scenarios.
arXiv Detail & Related papers (2021-04-09T14:50:59Z) - Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z) - Sample-Efficient Reinforcement Learning via Counterfactual-Based Data
Augmentation [15.451690870640295]
In some scenarios such as healthcare, usually only few records are available for each patient, impeding the application of currentReinforcement learning algorithms.
We propose a data-efficient RL algorithm that exploits structural causal models (SCMs) to model the state dynamics.
We show that counterfactual outcomes are identifiable under mild conditions and that Q- learning on the counterfactual-based augmented data set converges to the optimal value function.
arXiv Detail & Related papers (2020-12-16T17:21:13Z) - Discrete Action On-Policy Learning with Action-Value Critic [72.20609919995086]
Reinforcement learning (RL) in discrete action space is ubiquitous in real-world applications, but its complexity grows exponentially with the action-space dimension.
We construct a critic to estimate action-value functions, apply it on correlated actions, and combine these critic estimated action values to control the variance of gradient estimation.
These efforts result in a new discrete action on-policy RL algorithm that empirically outperforms related on-policy algorithms relying on variance control techniques.
arXiv Detail & Related papers (2020-02-10T04:23:09Z) - A Nonparametric Off-Policy Policy Gradient [32.35604597324448]
Reinforcement learning (RL) algorithms still suffer from high sample complexity despite outstanding recent successes.
We build on the general sample efficiency of off-policy algorithms.
We show that our approach has better sample efficiency than state-of-the-art policy gradient methods.
arXiv Detail & Related papers (2020-01-08T10:13:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.