Ranking Policy Decisions
- URL: http://arxiv.org/abs/2008.13607v3
- Date: Tue, 26 Oct 2021 17:28:22 GMT
- Title: Ranking Policy Decisions
- Authors: Hadrien Pouget, Hana Chockler, Youcheng Sun, Daniel Kroening
- Abstract summary: Policies trained via Reinforcement Learning (RL) are often needlessly complex, making them difficult to analyse and interpret.
We propose a novel black-box method based on statistical fault localisation that ranks the states of the environment according to the importance of decisions made in those states.
- Score: 14.562620527204686
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Policies trained via Reinforcement Learning (RL) are often needlessly
complex, making them difficult to analyse and interpret. In a run with $n$ time
steps, a policy will make $n$ decisions on actions to take; we conjecture that
only a small subset of these decisions delivers value over selecting a simple
default action. Given a trained policy, we propose a novel black-box method
based on statistical fault localisation that ranks the states of the
environment according to the importance of decisions made in those states. We
argue that among other things, the ranked list of states can help explain and
understand the policy. As the ranking method is statistical, a direct
evaluation of its quality is hard. As a proxy for quality, we use the ranking
to create new, simpler policies from the original ones by pruning decisions
identified as unimportant (that is, replacing them by default actions) and
measuring the impact on performance. Our experiments on a diverse set of
standard benchmarks demonstrate that pruned policies can perform on a level
comparable to the original policies. Conversely, we show that naive approaches
for ranking policy decisions, e.g., ranking based on the frequency of visiting
a state, do not result in high-performing pruned policies.
Related papers
- Clustered Policy Decision Ranking [6.338178373376447]
In an episode with n time steps, a policy will make n decisions on actions to take, many of which may appear non-intuitive to the observer.
It is not clear which of these decisions directly contribute towards achieving the reward and how significant their contribution is.
We propose a black-box method based on statistical covariance estimation that clusters the states of the environment and ranks each cluster according to the importance of decisions made in its states.
arXiv Detail & Related papers (2023-11-21T20:16:02Z) - Conformal Off-Policy Evaluation in Markov Decision Processes [53.786439742572995]
Reinforcement Learning aims at identifying and evaluating efficient control policies from data.
Most methods for this learning task, referred to as Off-Policy Evaluation (OPE), do not come with accuracy and certainty guarantees.
We present a novel OPE method based on Conformal Prediction that outputs an interval containing the true reward of the target policy with a prescribed level of certainty.
arXiv Detail & Related papers (2023-04-05T16:45:11Z) - General Policy Evaluation and Improvement by Learning to Identify Few
But Crucial States [12.059140532198064]
Learning to evaluate and improve policies is a core problem of Reinforcement Learning.
A recently explored competitive alternative is to learn a single value function for many policies.
We show that our value function trained to evaluate NN policies is also invariant to changes of the policy architecture.
arXiv Detail & Related papers (2022-07-04T16:34:53Z) - Causal policy ranking [3.7819322027528113]
Given a trained policy, we propose a black-box method based on counterfactual reasoning that estimates the causal effect that these decisions have on reward attainment.
In this work, we compare our measure against an alternative, non-causal, ranking procedure, and discuss potential future work integrating causal algorithms into the interpretation of RL agent policies.
arXiv Detail & Related papers (2021-11-16T12:33:36Z) - Safe Policy Learning through Extrapolation: Application to Pre-trial
Risk Assessment [0.0]
We develop a robust optimization approach that partially identifies the expected utility of a policy, and then finds an optimal policy.
We extend this approach to common and important settings where humans make decisions with the aid of algorithmic recommendations.
We derive new classification and recommendation rules that retain the transparency and interpretability of the existing risk assessment instrument.
arXiv Detail & Related papers (2021-09-22T00:52:03Z) - Supervised Off-Policy Ranking [145.3039527243585]
Off-policy evaluation (OPE) leverages data generated by other policies to evaluate a target policy.
We propose supervised off-policy ranking that learns a policy scoring model by correctly ranking training policies with known performance.
Our method outperforms strong baseline OPE methods in terms of both rank correlation and performance gap between the truly best and the best of the ranked top three policies.
arXiv Detail & Related papers (2021-07-03T07:01:23Z) - Offline Policy Selection under Uncertainty [113.57441913299868]
We consider offline policy selection as learning preferences over a set of policy prospects given a fixed experience dataset.
Access to the full distribution over one's belief of the policy value enables more flexible selection algorithms under a wider range of downstream evaluation metrics.
We show how BayesDICE may be used to rank policies with respect to any arbitrary downstream policy selection metric.
arXiv Detail & Related papers (2020-12-12T23:09:21Z) - Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic
Policies [80.42316902296832]
We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous.
In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist.
We propose several new doubly robust estimators based on different kernelization approaches.
arXiv Detail & Related papers (2020-06-06T15:52:05Z) - Policy Evaluation Networks [50.53250641051648]
We introduce a scalable, differentiable fingerprinting mechanism that retains essential policy information in a concise embedding.
Our empirical results demonstrate that combining these three elements can produce policies that outperform those that generated the training data.
arXiv Detail & Related papers (2020-02-26T23:00:27Z) - BRPO: Batch Residual Policy Optimization [79.53696635382592]
In batch reinforcement learning, one often constrains a learned policy to be close to the behavior (data-generating) policy.
We propose residual policies, where the allowable deviation of the learned policy is state-action-dependent.
We derive a new for RL method, BRPO, which learns both the policy and allowable deviation that jointly maximize a lower bound on policy performance.
arXiv Detail & Related papers (2020-02-08T01:59:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.