Rule-Guided Reinforcement Learning Policy Evaluation and Improvement
- URL: http://arxiv.org/abs/2503.09270v1
- Date: Wed, 12 Mar 2025 11:13:08 GMT
- Title: Rule-Guided Reinforcement Learning Policy Evaluation and Improvement
- Authors: Martin Tappler, Ignacio D. Lopez-Miguel, Sebastian Tschiatschek, Ezio Bartocci,
- Abstract summary: LEGIBLE is a novel approach to improving deep reinforcement learning policies.<n>It starts by mining rules from a deep RL policy, constituting a partially symbolic representation.<n>In the second step, we generalize the mined rules using domain knowledge expressed as metamorphic relations.<n>The third step is evaluating generalized rules to determine which generalizations improve performance when enforced.
- Score: 9.077163856137505
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the challenging problem of using domain knowledge to improve deep reinforcement learning policies. To this end, we propose LEGIBLE, a novel approach, following a multi-step process, which starts by mining rules from a deep RL policy, constituting a partially symbolic representation. These rules describe which decisions the RL policy makes and which it avoids making. In the second step, we generalize the mined rules using domain knowledge expressed as metamorphic relations. We adapt these relations from software testing to RL to specify expected changes of actions in response to changes in observations. The third step is evaluating generalized rules to determine which generalizations improve performance when enforced. These improvements show weaknesses in the policy, where it has not learned the general rules and thus can be improved by rule guidance. LEGIBLE supported by metamorphic relations provides a principled way of expressing and enforcing domain knowledge about RL environments. We show the efficacy of our approach by demonstrating that it effectively finds weaknesses, accompanied by explanations of these weaknesses, in eleven RL environments and by showcasing that guiding policy execution with rules improves performance w.r.t. gained reward.
Related papers
- Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions.
We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z) - Counterfactual Explanation Policies in RL [3.674863913115432]
COUNTERPOL is the first framework to analyze Reinforcement Learning policies using counterfactual explanations.
We establish a theoretical connection between Counterpol and widely used trust region-based policy optimization methods in RL.
arXiv Detail & Related papers (2023-07-25T01:14:56Z) - Offline Reinforcement Learning with Closed-Form Policy Improvement
Operators [88.54210578912554]
Behavior constrained policy optimization has been demonstrated to be a successful paradigm for tackling Offline Reinforcement Learning.
In this paper, we propose our closed-form policy improvement operators.
We empirically demonstrate their effectiveness over state-of-the-art algorithms on the standard D4RL benchmark.
arXiv Detail & Related papers (2022-11-29T06:29:26Z) - Mutual Information Regularized Offline Reinforcement Learning [76.05299071490913]
We propose a novel MISA framework to approach offline RL from the perspective of Mutual Information between States and Actions in the dataset.
We show that optimizing this lower bound is equivalent to maximizing the likelihood of a one-step improved policy on the offline dataset.
We introduce 3 different variants of MISA, and empirically demonstrate that tighter mutual information lower bound gives better offline RL performance.
arXiv Detail & Related papers (2022-10-14T03:22:43Z) - Towards an Understanding of Default Policies in Multitask Policy
Optimization [29.806071693039655]
Much of the recent success of deep reinforcement learning has been driven by regularized policy optimization (RPO) algorithms.
We take a first step towards filling this gap by formally linking the quality of the default policy to its effect on optimization.
We then derive a principled RPO algorithm for multitask learning with strong performance guarantees.
arXiv Detail & Related papers (2021-11-04T16:45:15Z) - Policy Mirror Descent for Regularized Reinforcement Learning: A
Generalized Framework with Linear Convergence [60.20076757208645]
This paper proposes a general policy mirror descent (GPMD) algorithm for solving regularized RL.
We demonstrate that our algorithm converges linearly over an entire range learning rates, in a dimension-free fashion, to the global solution.
arXiv Detail & Related papers (2021-05-24T02:21:34Z) - Privacy-Constrained Policies via Mutual Information Regularized Policy Gradients [54.98496284653234]
We consider the task of training a policy that maximizes reward while minimizing disclosure of certain sensitive state variables through the actions.
We solve this problem by introducing a regularizer based on the mutual information between the sensitive state and the actions.
We develop a model-based estimator for optimization of privacy-constrained policies.
arXiv Detail & Related papers (2020-12-30T03:22:35Z) - Invariant Policy Optimization: Towards Stronger Generalization in
Reinforcement Learning [5.476958867922322]
A fundamental challenge in reinforcement learning is to learn policies that generalize beyond the operating domains experienced during training.
We propose a novel learning algorithm, Invariant Policy Optimization (IPO), that implements this principle and learns an invariant policy during training.
arXiv Detail & Related papers (2020-06-01T17:28:19Z) - Reinforcement Learning [36.664136621546575]
Reinforcement learning (RL) is a general framework for adaptive control, which has proven to be efficient in many domains.
In this chapter, we present the basic framework of RL and recall the two main families of approaches that have been developed to learn a good policy.
arXiv Detail & Related papers (2020-05-29T06:53:29Z) - BRPO: Batch Residual Policy Optimization [79.53696635382592]
In batch reinforcement learning, one often constrains a learned policy to be close to the behavior (data-generating) policy.
We propose residual policies, where the allowable deviation of the learned policy is state-action-dependent.
We derive a new for RL method, BRPO, which learns both the policy and allowable deviation that jointly maximize a lower bound on policy performance.
arXiv Detail & Related papers (2020-02-08T01:59:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.