Pessimism About Unknown Unknowns Inspires Conservatism
- URL: http://arxiv.org/abs/2006.08753v1
- Date: Mon, 15 Jun 2020 20:46:33 GMT
- Title: Pessimism About Unknown Unknowns Inspires Conservatism
- Authors: Michael K. Cohen and Marcus Hutter
- Abstract summary: We define an idealized Bayesian reinforcement learner which follows a policy that maximizes the worst-case expected reward over a set of world-models.
A scalar parameter tunes the agent's pessimism by changing the size of the set of world-models taken into account.
Since pessimism discourages exploration, at each timestep, the agent may defer to a mentor, who may be a human or some known-safe policy.
- Score: 24.085795452335145
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: If we could define the set of all bad outcomes, we could hard-code an agent
which avoids them; however, in sufficiently complex environments, this is
infeasible. We do not know of any general-purpose approaches in the literature
to avoiding novel failure modes. Motivated by this, we define an idealized
Bayesian reinforcement learner which follows a policy that maximizes the
worst-case expected reward over a set of world-models. We call this agent
pessimistic, since it optimizes assuming the worst case. A scalar parameter
tunes the agent's pessimism by changing the size of the set of world-models
taken into account. Our first main contribution is: given an assumption about
the agent's model class, a sufficiently pessimistic agent does not cause
"unprecedented events" with probability $1-\delta$, whether or not designers
know how to precisely specify those precedents they are concerned with. Since
pessimism discourages exploration, at each timestep, the agent may defer to a
mentor, who may be a human or some known-safe policy we would like to improve.
Our other main contribution is that the agent's policy's value approaches at
least that of the mentor, while the probability of deferring to the mentor goes
to 0. In high-stakes environments, we might like advanced artificial agents to
pursue goals cautiously, which is a non-trivial problem even if the agent were
allowed arbitrary computing power; we present a formal solution.
Related papers
- Safe Exploitative Play with Untrusted Type Beliefs [21.177698937011183]
We study the idea of controlling a single agent in a system composed of multiple agents with unknown behaviors given a set of types.
The type beliefs are often learned from past actions and likely to be incorrect.
We define a tradeoff between risk and opportunity by comparing the payoff obtained against the optimal payoff.
arXiv Detail & Related papers (2024-11-12T09:49:16Z) - Criticality and Safety Margins for Reinforcement Learning [53.10194953873209]
We seek to define a criticality framework with both a quantifiable ground truth and a clear significance to users.
We introduce true criticality as the expected drop in reward when an agent deviates from its policy for n consecutive random actions.
We also introduce the concept of proxy criticality, a low-overhead metric that has a statistically monotonic relationship to true criticality.
arXiv Detail & Related papers (2024-09-26T21:00:45Z) - Pure Exploration under Mediators' Feedback [63.56002444692792]
Multi-armed bandits are a sequential-decision-making framework, where, at each interaction step, the learner selects an arm and observes a reward.
We consider the scenario in which the learner has access to a set of mediators, each of which selects the arms on the agent's behalf according to a and possibly unknown policy.
We propose a sequential decision-making strategy for discovering the best arm under the assumption that the mediators' policies are known to the learner.
arXiv Detail & Related papers (2023-08-29T18:18:21Z) - Estimating and Incentivizing Imperfect-Knowledge Agents with Hidden
Rewards [4.742123770879715]
In practice, incentive providers often cannot observe the reward realizations of incentivized agents.
This paper explores a repeated adverse selection game between a self-interested learning agent and a learning principal.
We introduce an estimator whose only input is the history of principal's incentives and agent's choices.
arXiv Detail & Related papers (2023-08-13T08:12:01Z) - Policy Gradient Bayesian Robust Optimization for Imitation Learning [49.881386773269746]
We derive a novel policy gradient-style robust optimization approach, PG-BROIL, to balance expected performance and risk.
Results suggest PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse.
arXiv Detail & Related papers (2021-06-11T16:49:15Z) - ERMAS: Becoming Robust to Reward Function Sim-to-Real Gaps in
Multi-Agent Simulations [110.72725220033983]
Epsilon-Robust Multi-Agent Simulation (ERMAS) is a framework for learning AI policies that are robust to such multiagent sim-to-real gaps.
ERMAS learns tax policies that are robust to changes in agent risk aversion, improving social welfare by up to 15% in complextemporal simulations.
In particular, ERMAS learns tax policies that are robust to changes in agent risk aversion, improving social welfare by up to 15% in complextemporal simulations.
arXiv Detail & Related papers (2021-06-10T04:32:20Z) - Heterogeneous-Agent Trajectory Forecasting Incorporating Class
Uncertainty [54.88405167739227]
We present HAICU, a method for heterogeneous-agent trajectory forecasting that explicitly incorporates agents' class probabilities.
We additionally present PUP, a new challenging real-world autonomous driving dataset.
We demonstrate that incorporating class probabilities in trajectory forecasting significantly improves performance in the face of uncertainty.
arXiv Detail & Related papers (2021-04-26T10:28:34Z) - Deciding What to Learn: A Rate-Distortion Approach [21.945359614094503]
In a complex environment, aiming to synthesize an optimal policy can become infeasible.
We automate the process of translating a designer's preferences into a fixed learning target for an agent.
We show improvements over Thompson sampling in identifying an optimal policy.
arXiv Detail & Related papers (2021-01-15T16:22:49Z) - Performance of Bounded-Rational Agents With the Ability to Self-Modify [1.933681537640272]
Self-modification of agents embedded in complex environments is hard to avoid.
It has been argued that intelligent agents have an incentive to avoid modifying their utility function so that their future instances work towards the same goals.
We show that this result is no longer true for agents with bounded rationality.
arXiv Detail & Related papers (2020-11-12T09:25:08Z) - Curiosity Killed or Incapacitated the Cat and the Asymptotically Optimal
Agent [21.548271801592907]
Reinforcement learners are agents that learn to pick actions that lead to high reward.
We show that if an agent is guaranteed to be "asymptotically optimal" in any environment, then subject to an assumption about the true environment, this agent will be either "destroyed" or "incapacitated"
We present an agent, Mentee, with the modest guarantee of approaching the performance of a mentor, doing safe exploration instead of reckless exploration.
arXiv Detail & Related papers (2020-06-05T10:42:29Z) - Maximizing Information Gain in Partially Observable Environments via
Prediction Reward [64.24528565312463]
This paper tackles the challenge of using belief-based rewards for a deep RL agent.
We derive the exact error between negative entropy and the expected prediction reward.
This insight provides theoretical motivation for several fields using prediction rewards.
arXiv Detail & Related papers (2020-05-11T08:13:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.