Related papers: Generative Modelling of Stochastic Actions with Arbitrary Constraints in Reinforcement Learning

Generative Modelling of Stochastic Actions with Arbitrary Constraints in Reinforcement Learning

URL: http://arxiv.org/abs/2311.15341v1
Date: Sun, 26 Nov 2023 15:57:20 GMT
Title: Generative Modelling of Stochastic Actions with Arbitrary Constraints in Reinforcement Learning
Authors: Changyu Chen, Ramesha Karunasena, Thanh Hong Nguyen, Arunesh Sinha, Pradeep Varakantham
Abstract summary: Many problems in Reinforcement Learning (RL) seek an optimal policy with large discrete multidimensional yet unordered action spaces. A challenge in this setting is that the underlying action space is categorical (discrete and unordered) and large. In this work, we address these challenges by applying a (state) conditional normalizing flow to compactly represent the policy.
Score: 25.342811509665097
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Many problems in Reinforcement Learning (RL) seek an optimal policy with large discrete multidimensional yet unordered action spaces; these include problems in randomized allocation of resources such as placements of multiple security resources and emergency response units, etc. A challenge in this setting is that the underlying action space is categorical (discrete and unordered) and large, for which existing RL methods do not perform well. Moreover, these problems require validity of the realized action (allocation); this validity constraint is often difficult to express compactly in a closed mathematical form. The allocation nature of the problem also prefers stochastic optimal policies, if one exists. In this work, we address these challenges by (1) applying a (state) conditional normalizing flow to compactly represent the stochastic policy -- the compactness arises due to the network only producing one sampled action and the corresponding log probability of the action, which is then used by an actor-critic method; and (2) employing an invalid action rejection method (via a valid action oracle) to update the base policy. The action rejection is enabled by a modified policy gradient that we derive. Finally, we conduct extensive experiments to show the scalability of our approach compared to prior methods and the ability to enforce arbitrary state-conditional constraints on the support of the distribution of actions in any state.

Related papers

Efficient Action-Constrained Reinforcement Learning via Acceptance-Rejection Method and Augmented MDPs [13.443196224057658]
Action-constrained reinforcement learning (ACRL) is a generic framework for learning control policies with zero action constraint violation. We propose a generic and computationally efficient framework that can adapt a standard unconstrained RL method to ACRL. We show that the proposed framework enjoys faster training progress, better constraint satisfaction, and a lower action inference time simultaneously than the state-of-the-art ACRL methods.
arXiv Detail & Related papers (2025-03-17T08:41:43Z)
Leveraging Constraint Violation Signals For Action-Constrained Reinforcement Learning [13.332006760984122]
Action-Constrained Reinforcement Learning (ACRL) employs a projection layer after the policy network to correct the action. Recent methods were proposed to train generative models to learn a differentiable mapping between latent variables and feasible actions.
arXiv Detail & Related papers (2025-02-08T12:58:26Z)
Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions. We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z)
Uniformly Safe RL with Objective Suppression for Multi-Constraint Safety-Critical Applications [73.58451824894568]
The widely adopted CMDP model constrains the risks in expectation, which makes room for dangerous behaviors in long-tail states. In safety-critical domains, such behaviors could lead to disastrous outcomes. We propose Objective Suppression, a novel method that adaptively suppresses the task reward maximizing objectives according to a safety critic.
arXiv Detail & Related papers (2024-02-23T23:22:06Z)
FlowPG: Action-constrained Policy Gradient with Normalizing Flows [14.98383953401637]
Action-constrained reinforcement learning (ACRL) is a popular approach for solving safety-critical resource-alential related decision making problems. A major challenge in ACRL is to ensure agent taking a valid action satisfying constraints in each step.
arXiv Detail & Related papers (2024-02-07T11:11:46Z)
Off-Policy Evaluation for Large Action Spaces via Policy Convolution [60.6953713877886]
Policy Convolution family of estimators uses latent structure within actions to strategically convolve the logging and target policies. Experiments on synthetic and benchmark datasets demonstrate remarkable mean squared error (MSE) improvements when using PC.
arXiv Detail & Related papers (2023-10-24T01:00:01Z)
Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning [70.20191211010847]
Offline reinforcement learning (RL) aims to learn an optimal policy using a previously collected static dataset. We introduce Diffusion Q-learning (Diffusion-QL) that utilizes a conditional diffusion model to represent the policy. We show that our method can achieve state-of-the-art performance on the majority of the D4RL benchmark tasks.
arXiv Detail & Related papers (2022-08-12T09:54:11Z)
A Prescriptive Dirichlet Power Allocation Policy with Deep Reinforcement Learning [6.003234406806134]
In this work, we propose the Dirichlet policy for continuous allocation tasks and analyze the bias and variance of its policy gradients. We demonstrate that the Dirichlet policy is bias-free and provides significantly faster convergence and better performance than the Gaussian-softmax policy. The experimental results show the potential to prescribe optimal operation, improve the efficiency and sustainability of multi-power source systems.
arXiv Detail & Related papers (2022-01-20T20:41:04Z)
Reinforcement Learning With Sparse-Executing Actions via Sparsity Regularization [15.945378631406024]
Reinforcement learning (RL) has demonstrated impressive performance in decision-making tasks like embodied control, autonomous driving and financial trading. In many decision-making tasks, the agents often encounter the problem of executing actions under limited budgets. This paper formalizes the problem as a Sparse Action Markov Decision Process (SA-MDP), in which specific actions in the action space can only be executed for a limited time. We propose a policy optimization algorithm, Action Sparsity REgularization (ASRE), which adaptively handles each action with a distinct preference.
arXiv Detail & Related papers (2021-05-18T16:50:42Z)
Addressing Action Oscillations through Learning Policy Inertia [26.171039226334504]
Policy Inertia Controller (PIC) serves as a generic plug-in framework to off-the-shelf DRL algorithms. We propose Nested Policy Iteration as a general training algorithm for PIC-augmented policy. We derive a practical DRL algorithm, namely Nested Soft Actor-Critic.
arXiv Detail & Related papers (2021-03-03T09:59:43Z)
State Augmented Constrained Reinforcement Learning: Overcoming the Limitations of Learning with Rewards [88.30521204048551]
A common formulation of constrained reinforcement learning involves multiple rewards that must individually accumulate to given thresholds. We show a simple example in which the desired optimal policy cannot be induced by any weighted linear combination of rewards. This work addresses this shortcoming by augmenting the state with Lagrange multipliers and reinterpreting primal-dual methods.
arXiv Detail & Related papers (2021-02-23T21:07:35Z)
Deep Constrained Q-learning [15.582910645906145]
In many real world applications, reinforcement learning agents have to optimize multiple objectives while following certain rules or satisfying a set of constraints. We propose Constrained Q-learning, a novel off-policy reinforcement learning framework restricting the action space directly in the Q-update to learn the optimal Q-function for the induced constrained MDP and the corresponding safe policy.
arXiv Detail & Related papers (2020-03-20T17:26:03Z)
Discrete Action On-Policy Learning with Action-Value Critic [72.20609919995086]
Reinforcement learning (RL) in discrete action space is ubiquitous in real-world applications, but its complexity grows exponentially with the action-space dimension. We construct a critic to estimate action-value functions, apply it on correlated actions, and combine these critic estimated action values to control the variance of gradient estimation. These efforts result in a new discrete action on-policy RL algorithm that empirically outperforms related on-policy algorithms relying on variance control techniques.
arXiv Detail & Related papers (2020-02-10T04:23:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.