Related papers: Ranking Policy Decisions

Ranking Policy Decisions

URL: http://arxiv.org/abs/2008.13607v3
Date: Tue, 26 Oct 2021 17:28:22 GMT
Title: Ranking Policy Decisions
Authors: Hadrien Pouget, Hana Chockler, Youcheng Sun, Daniel Kroening
Abstract summary: Policies trained via Reinforcement Learning (RL) are often needlessly complex, making them difficult to analyse and interpret. We propose a novel black-box method based on statistical fault localisation that ranks the states of the environment according to the importance of decisions made in those states.
Score: 14.562620527204686
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Policies trained via Reinforcement Learning (RL) are often needlessly complex, making them difficult to analyse and interpret. In a run with $n$ time steps, a policy will make $n$ decisions on actions to take; we conjecture that only a small subset of these decisions delivers value over selecting a simple default action. Given a trained policy, we propose a novel black-box method based on statistical fault localisation that ranks the states of the environment according to the importance of decisions made in those states. We argue that among other things, the ranked list of states can help explain and understand the policy. As the ranking method is statistical, a direct evaluation of its quality is hard. As a proxy for quality, we use the ranking to create new, simpler policies from the original ones by pruning decisions identified as unimportant (that is, replacing them by default actions) and measuring the impact on performance. Our experiments on a diverse set of standard benchmarks demonstrate that pruned policies can perform on a level comparable to the original policies. Conversely, we show that naive approaches for ranking policy decisions, e.g., ranking based on the frequency of visiting a state, do not result in high-performing pruned policies.

Related papers

Clustered Policy Decision Ranking [6.338178373376447]
In an episode with n time steps, a policy will make n decisions on actions to take, many of which may appear non-intuitive to the observer. It is not clear which of these decisions directly contribute towards achieving the reward and how significant their contribution is. We propose a black-box method based on statistical covariance estimation that clusters the states of the environment and ranks each cluster according to the importance of decisions made in its states.
arXiv Detail & Related papers (2023-11-21T20:16:02Z)
Conformal Off-Policy Evaluation in Markov Decision Processes [53.786439742572995]
Reinforcement Learning aims at identifying and evaluating efficient control policies from data. Most methods for this learning task, referred to as Off-Policy Evaluation (OPE), do not come with accuracy and certainty guarantees. We present a novel OPE method based on Conformal Prediction that outputs an interval containing the true reward of the target policy with a prescribed level of certainty.
arXiv Detail & Related papers (2023-04-05T16:45:11Z)
General Policy Evaluation and Improvement by Learning to Identify Few But Crucial States [12.059140532198064]
Learning to evaluate and improve policies is a core problem of Reinforcement Learning. A recently explored competitive alternative is to learn a single value function for many policies. We show that our value function trained to evaluate NN policies is also invariant to changes of the policy architecture.
arXiv Detail & Related papers (2022-07-04T16:34:53Z)
Generalizing Off-Policy Learning under Sample Selection Bias [15.733136147164032]
We propose a novel framework for learning policies that generalize to the target population. We prove that, if the uncertainty set is well-specified, our policies generalize to the target population as they can not do worse than on the training data.
arXiv Detail & Related papers (2021-12-02T16:18:16Z)
Causal policy ranking [3.7819322027528113]
Given a trained policy, we propose a black-box method based on counterfactual reasoning that estimates the causal effect that these decisions have on reward attainment. In this work, we compare our measure against an alternative, non-causal, ranking procedure, and discuss potential future work integrating causal algorithms into the interpretation of RL agent policies.
arXiv Detail & Related papers (2021-11-16T12:33:36Z)
Safe Policy Learning through Extrapolation: Application to Pre-trial Risk Assessment [0.0]
We develop a robust optimization approach that partially identifies the expected utility of a policy, and then finds an optimal policy. We extend this approach to common and important settings where humans make decisions with the aid of algorithmic recommendations. We derive new classification and recommendation rules that retain the transparency and interpretability of the existing risk assessment instrument.
arXiv Detail & Related papers (2021-09-22T00:52:03Z)
Supervised Off-Policy Ranking [145.3039527243585]
Off-policy evaluation (OPE) leverages data generated by other policies to evaluate a target policy. We propose supervised off-policy ranking that learns a policy scoring model by correctly ranking training policies with known performance. Our method outperforms strong baseline OPE methods in terms of both rank correlation and performance gap between the truly best and the best of the ranked top three policies.
arXiv Detail & Related papers (2021-07-03T07:01:23Z)
Offline Policy Selection under Uncertainty [113.57441913299868]
We consider offline policy selection as learning preferences over a set of policy prospects given a fixed experience dataset. Access to the full distribution over one's belief of the policy value enables more flexible selection algorithms under a wider range of downstream evaluation metrics. We show how BayesDICE may be used to rank policies with respect to any arbitrary downstream policy selection metric.
arXiv Detail & Related papers (2020-12-12T23:09:21Z)
Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic Policies [80.42316902296832]
We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous. In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist. We propose several new doubly robust estimators based on different kernelization approaches.
arXiv Detail & Related papers (2020-06-06T15:52:05Z)
Policy Evaluation Networks [50.53250641051648]
We introduce a scalable, differentiable fingerprinting mechanism that retains essential policy information in a concise embedding. Our empirical results demonstrate that combining these three elements can produce policies that outperform those that generated the training data.
arXiv Detail & Related papers (2020-02-26T23:00:27Z)
BRPO: Batch Residual Policy Optimization [79.53696635382592]
In batch reinforcement learning, one often constrains a learned policy to be close to the behavior (data-generating) policy. We propose residual policies, where the allowable deviation of the learned policy is state-action-dependent. We derive a new for RL method, BRPO, which learns both the policy and allowable deviation that jointly maximize a lower bound on policy performance.
arXiv Detail & Related papers (2020-02-08T01:59:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.