Curiosity Killed or Incapacitated the Cat and the Asymptotically Optimal
Agent
- URL: http://arxiv.org/abs/2006.03357v2
- Date: Wed, 26 May 2021 15:55:28 GMT
- Title: Curiosity Killed or Incapacitated the Cat and the Asymptotically Optimal
Agent
- Authors: Michael K. Cohen and Elliot Catt and Marcus Hutter
- Abstract summary: Reinforcement learners are agents that learn to pick actions that lead to high reward.
We show that if an agent is guaranteed to be "asymptotically optimal" in any environment, then subject to an assumption about the true environment, this agent will be either "destroyed" or "incapacitated"
We present an agent, Mentee, with the modest guarantee of approaching the performance of a mentor, doing safe exploration instead of reckless exploration.
- Score: 21.548271801592907
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement learners are agents that learn to pick actions that lead to
high reward. Ideally, the value of a reinforcement learner's policy approaches
optimality--where the optimal informed policy is the one which maximizes
reward. Unfortunately, we show that if an agent is guaranteed to be
"asymptotically optimal" in any (stochastically computable) environment, then
subject to an assumption about the true environment, this agent will be either
"destroyed" or "incapacitated" with probability 1. Much work in reinforcement
learning uses an ergodicity assumption to avoid this problem. Often, doing
theoretical research under simplifying assumptions prepares us to provide
practical solutions even in the absence of those assumptions, but the
ergodicity assumption in reinforcement learning may have led us entirely astray
in preparing safe and effective exploration strategies for agents in dangerous
environments. Rather than assuming away the problem, we present an agent,
Mentee, with the modest guarantee of approaching the performance of a mentor,
doing safe exploration instead of reckless exploration. Critically, Mentee's
exploration probability depends on the expected information gain from
exploring. In a simple non-ergodic environment with a weak mentor, we find
Mentee outperforms existing asymptotically optimal agents and its mentor.
Related papers
- Criticality and Safety Margins for Reinforcement Learning [53.10194953873209]
We seek to define a criticality framework with both a quantifiable ground truth and a clear significance to users.
We introduce true criticality as the expected drop in reward when an agent deviates from its policy for n consecutive random actions.
We also introduce the concept of proxy criticality, a low-overhead metric that has a statistically monotonic relationship to true criticality.
arXiv Detail & Related papers (2024-09-26T21:00:45Z) - Satisficing Exploration for Deep Reinforcement Learning [26.73584163318647]
In complex environments that approach the vastness and scale of the real world, attaining optimal performance may in fact be an entirely intractable endeavor.
Recent work has leveraged tools from information theory to design agents that deliberately forgo optimal solutions in favor of sufficiently-satisfying or satisficing solutions.
We extend an agent that directly represents uncertainty over the optimal value function allowing it to both bypass the need for model-based planning and to learn satisficing policies.
arXiv Detail & Related papers (2024-07-16T21:28:03Z) - An agent design with goal reaching guarantees for enhancement of learning [40.76517286989928]
Reinforcement learning is concerned with problems of maximizing accumulated rewards in Markov decision processes.
We suggest an algorithm, which is fairly flexible, and can be used to augment practically any agent as long as it comprises of a critic.
arXiv Detail & Related papers (2024-05-28T12:27:36Z) - Inverse Reinforcement Learning with Sub-optimal Experts [56.553106680769474]
We study the theoretical properties of the class of reward functions that are compatible with a given set of experts.
Our results show that the presence of multiple sub-optimal experts can significantly shrink the set of compatible rewards.
We analyze a uniform sampling algorithm that results in being minimax optimal whenever the sub-optimal experts' performance level is sufficiently close to the one of the optimal agent.
arXiv Detail & Related papers (2024-01-08T12:39:25Z) - Power-seeking can be probable and predictive for trained agents [3.616948583169635]
Power-seeking behavior is a key source of risk from advanced AI.
We investigate how the training process affects power-seeking incentives.
We show that power-seeking incentives can be probable and predictive.
arXiv Detail & Related papers (2023-04-13T13:29:01Z) - Self-supervised network distillation: an effective approach to exploration in sparse reward environments [0.0]
Reinforcement learning can train an agent to behave in an environment according to a predesigned reward function.
The solution to such a problem may be to equip the agent with an intrinsic motivation that will provide informed exploration.
We present Self-supervised Network Distillation (SND), a class of intrinsic motivation algorithms based on the distillation error as a novelty indicator.
arXiv Detail & Related papers (2023-02-22T18:58:09Z) - Parametrically Retargetable Decision-Makers Tend To Seek Power [91.93765604105025]
In fully observable environments, most reward functions have an optimal policy which seeks power by keeping options open and staying alive.
We consider a range of models of AI decision-making, from optimal, to random, to choices informed by learning and interacting with an environment.
We show that a range of qualitatively dissimilar decision-making procedures incentivize agents to seek power.
arXiv Detail & Related papers (2022-06-27T17:39:23Z) - Explore and Control with Adversarial Surprise [78.41972292110967]
Reinforcement learning (RL) provides a framework for learning goal-directed policies given user-specified rewards.
We propose a new unsupervised RL technique based on an adversarial game which pits two policies against each other to compete over the amount of surprise an RL agent experiences.
We show that our method leads to the emergence of complex skills by exhibiting clear phase transitions.
arXiv Detail & Related papers (2021-07-12T17:58:40Z) - Policy Gradient Bayesian Robust Optimization for Imitation Learning [49.881386773269746]
We derive a novel policy gradient-style robust optimization approach, PG-BROIL, to balance expected performance and risk.
Results suggest PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse.
arXiv Detail & Related papers (2021-06-11T16:49:15Z) - Safe Reinforcement Learning via Curriculum Induction [94.67835258431202]
In safety-critical applications, autonomous agents may need to learn in an environment where mistakes can be very costly.
Existing safe reinforcement learning methods make an agent rely on priors that let it avoid dangerous situations.
This paper presents an alternative approach inspired by human teaching, where an agent learns under the supervision of an automatic instructor.
arXiv Detail & Related papers (2020-06-22T10:48:17Z) - Pessimism About Unknown Unknowns Inspires Conservatism [24.085795452335145]
We define an idealized Bayesian reinforcement learner which follows a policy that maximizes the worst-case expected reward over a set of world-models.
A scalar parameter tunes the agent's pessimism by changing the size of the set of world-models taken into account.
Since pessimism discourages exploration, at each timestep, the agent may defer to a mentor, who may be a human or some known-safe policy.
arXiv Detail & Related papers (2020-06-15T20:46:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.