Synthesizing Policies That Account For Human Execution Errors Caused By
StateAliasing In Markov Decision Processes
- URL: http://arxiv.org/abs/2109.07436v1
- Date: Wed, 15 Sep 2021 17:10:46 GMT
- Title: Synthesizing Policies That Account For Human Execution Errors Caused By
StateAliasing In Markov Decision Processes
- Authors: Sriram Gopalakrishnan, Mudit Verma, Subbarao Kambhampati
- Abstract summary: An optimal MDP policy that is poorly ex-ecuted (because of a human agent) maybe much worse thananother policy that is executed with fewer errors.
We present a framework to model the likelihood ofpolicy execution errors and likelihood of non-policy actionslike inaction (delays) due to state uncertainty.
We then use the best policy found byhill climbing with a branch and bound algorithm to find the optimal policy.
- Score: 15.450115485745767
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: When humans are given a policy to execute, there can be pol-icy execution
errors and deviations in execution if there is un-certainty in identifying a
state. So an algorithm that computesa policy for a human to execute ought to
consider these effectsin its computations. An optimal MDP policy that is poorly
ex-ecuted (because of a human agent) maybe much worse thananother policy that
is executed with fewer errors. In this pa-per, we consider the problems of
erroneous execution and ex-ecution delay when computing policies for a human
agent thatwould act in a setting modeled by a Markov Decision Process(MDP). We
present a framework to model the likelihood ofpolicy execution errors and
likelihood of non-policy actionslike inaction (delays) due to state
uncertainty. This is followedby a hill climbing algorithm to search for good
policies thataccount for these errors. We then use the best policy found byhill
climbing with a branch and bound algorithm to find theoptimal policy. We show
experimental results in a Gridworlddomain and analyze the performance of the
two algorithms.We also present human studies that verify if our assumptionson
policy execution by humans under state-aliasing are rea-sonable.
Related papers
- Oracle-Efficient Reinforcement Learning for Max Value Ensembles [7.404901768256101]
Reinforcement learning (RL) in large or infinite state spaces is notoriously challenging, theoretically and experimentally.
In this work we aim to compete with the $textitmax-following policy$, which at each state follows the action of whichever constituent policy has the highest value.
Our main result is an efficient algorithm that learns to compete with the max-following policy, given only access to the constituent policies.
arXiv Detail & Related papers (2024-05-27T01:08:23Z) - Model-Free, Regret-Optimal Best Policy Identification in Online CMDPs [17.62509045102346]
This paper considers the best policy identification problem in online Constrained Markov Decision Processes (CMDPs)
We are interested in algorithms that are model-free, have low regret, and identify an approximately optimal policy with a high probability.
Existing model-free algorithms for online CMDPs with sublinear regret and constraint violation do not provide any convergence guarantee to an optimal policy.
arXiv Detail & Related papers (2023-09-27T04:33:09Z) - A New Policy Iteration Algorithm For Reinforcement Learning in Zero-Sum
Markov Games [10.805520579293747]
We show that a simple variant of naive policy iteration for games converges exponentially fast.
We also show that lookahead policies can be implemented efficiently in the function approximation setting of linear Markov games.
arXiv Detail & Related papers (2023-03-17T01:20:22Z) - Efficient Policy Iteration for Robust Markov Decision Processes via
Regularization [49.05403412954533]
Robust decision processes (MDPs) provide a framework to model decision problems where the system dynamics are changing or only partially known.
Recent work established the equivalence between texttts rectangular $L_p$ robust MDPs and regularized MDPs, and derived a regularized policy iteration scheme that enjoys the same level of efficiency as standard MDPs.
In this work, we focus on the policy improvement step and derive concrete forms for the greedy policy and the optimal robust Bellman operators.
arXiv Detail & Related papers (2022-05-28T04:05:20Z) - Modeling Strong and Human-Like Gameplay with KL-Regularized Search [64.24339197581769]
We consider the task of building strong but human-like policies in multi-agent decision-making problems.
Imitation learning is effective at predicting human actions but may not match the strength of expert humans.
We show in chess and Go that regularizing search policies based on the KL divergence from an imitation-learned policy by applying Monte Carlo tree search produces policies that have higher human prediction accuracy and are stronger than the imitation policy.
arXiv Detail & Related papers (2021-12-14T16:52:49Z) - Decentralized Multi-Agent Reinforcement Learning: An Off-Policy Method [6.261762915564555]
We discuss the problem of decentralized multi-agent reinforcement learning (MARL) in this work.
In our setting, the global state, action, and reward are assumed to be fully observable, while the local policy is protected as privacy by each agent, and thus cannot be shared with others.
The policy evaluation and policy improvement algorithms are designed for discrete and continuous state-action-space Markov Decision Process (MDP) respectively.
arXiv Detail & Related papers (2021-10-31T09:08:46Z) - Provable Benefits of Actor-Critic Methods for Offline Reinforcement
Learning [85.50033812217254]
Actor-critic methods are widely used in offline reinforcement learning practice, but are not so well-understood theoretically.
We propose a new offline actor-critic algorithm that naturally incorporates the pessimism principle.
arXiv Detail & Related papers (2021-08-19T17:27:29Z) - Implicitly Regularized RL with Implicit Q-Values [42.87920755961722]
The $Q$-function is a central quantity in many Reinforcement Learning (RL) algorithms for which RL agents behave following a (soft)-greedy policy.
We propose to parametrize the $Q$-function implicitly, as the sum of a log-policy and of a value function.
We derive a practical off-policy deep RL algorithm, suitable for large action spaces, and that enforces the softmax relation between the policy and the $Q$-value.
arXiv Detail & Related papers (2021-08-16T12:20:47Z) - Acting in Delayed Environments with Non-Stationary Markov Policies [57.52103323209643]
We introduce a framework for learning and planning in MDPs where the decision-maker commits actions that are executed with a delay of $m$ steps.
We prove that with execution delay, deterministic Markov policies in the original state-space are sufficient for attaining maximal reward, but need to be non-stationary.
We devise a non-stationary Q-learning style model-based algorithm that solves delayed execution tasks without resorting to state-augmentation.
arXiv Detail & Related papers (2021-01-28T13:35:37Z) - Policy Gradient for Continuing Tasks in Non-stationary Markov Decision
Processes [112.38662246621969]
Reinforcement learning considers the problem of finding policies that maximize an expected cumulative reward in a Markov decision process with unknown transition probabilities.
We compute unbiased navigation gradients of the value function which we use as ascent directions to update the policy.
A major drawback of policy gradient-type algorithms is that they are limited to episodic tasks unless stationarity assumptions are imposed.
arXiv Detail & Related papers (2020-10-16T15:15:42Z) - Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation.
Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.