Pessimism About Unknown Unknowns Inspires Conservatism
- URL: http://arxiv.org/abs/2006.08753v1
- Date: Mon, 15 Jun 2020 20:46:33 GMT
- Title: Pessimism About Unknown Unknowns Inspires Conservatism
- Authors: Michael K. Cohen and Marcus Hutter
- Abstract summary: We define an idealized Bayesian reinforcement learner which follows a policy that maximizes the worst-case expected reward over a set of world-models.
A scalar parameter tunes the agent's pessimism by changing the size of the set of world-models taken into account.
Since pessimism discourages exploration, at each timestep, the agent may defer to a mentor, who may be a human or some known-safe policy.
- Score: 24.085795452335145
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: If we could define the set of all bad outcomes, we could hard-code an agent
which avoids them; however, in sufficiently complex environments, this is
infeasible. We do not know of any general-purpose approaches in the literature
to avoiding novel failure modes. Motivated by this, we define an idealized
Bayesian reinforcement learner which follows a policy that maximizes the
worst-case expected reward over a set of world-models. We call this agent
pessimistic, since it optimizes assuming the worst case. A scalar parameter
tunes the agent's pessimism by changing the size of the set of world-models
taken into account. Our first main contribution is: given an assumption about
the agent's model class, a sufficiently pessimistic agent does not cause
"unprecedented events" with probability $1-\delta$, whether or not designers
know how to precisely specify those precedents they are concerned with. Since
pessimism discourages exploration, at each timestep, the agent may defer to a
mentor, who may be a human or some known-safe policy we would like to improve.
Our other main contribution is that the agent's policy's value approaches at
least that of the mentor, while the probability of deferring to the mentor goes
to 0. In high-stakes environments, we might like advanced artificial agents to
pursue goals cautiously, which is a non-trivial problem even if the agent were
allowed arbitrary computing power; we present a formal solution.
Related papers
- Pure Exploration under Mediators' Feedback [63.56002444692792]
Multi-armed bandits are a sequential-decision-making framework, where, at each interaction step, the learner selects an arm and observes a reward.
We consider the scenario in which the learner has access to a set of mediators, each of which selects the arms on the agent's behalf according to a and possibly unknown policy.
We propose a sequential decision-making strategy for discovering the best arm under the assumption that the mediators' policies are known to the learner.
arXiv Detail & Related papers (2023-08-29T18:18:21Z) - Estimating and Incentivizing Imperfect-Knowledge Agents with Hidden
Rewards [4.742123770879715]
In practice, incentive providers often cannot observe the reward realizations of incentivized agents.
This paper explores a repeated adverse selection game between a self-interested learning agent and a learning principal.
We introduce an estimator whose only input is the history of principal's incentives and agent's choices.
arXiv Detail & Related papers (2023-08-13T08:12:01Z) - Byzantine-Robust Online and Offline Distributed Reinforcement Learning [60.970950468309056]
We consider a distributed reinforcement learning setting where multiple agents explore the environment and communicate their experiences through a central server.
$alpha$-fraction of agents are adversarial and can report arbitrary fake information.
We seek to identify a near-optimal policy for the underlying Markov decision process in the presence of these adversarial agents.
arXiv Detail & Related papers (2022-06-01T00:44:53Z) - Deceptive Decision-Making Under Uncertainty [25.197098169762356]
We study the design of autonomous agents that are capable of deceiving outside observers about their intentions while carrying out tasks.
By modeling the agent's behavior as a Markov decision process, we consider a setting where the agent aims to reach one of multiple potential goals.
We propose a novel approach to model observer predictions based on the principle of maximum entropy and to efficiently generate deceptive strategies.
arXiv Detail & Related papers (2021-09-14T14:56:23Z) - Policy Gradient Bayesian Robust Optimization for Imitation Learning [49.881386773269746]
We derive a novel policy gradient-style robust optimization approach, PG-BROIL, to balance expected performance and risk.
Results suggest PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse.
arXiv Detail & Related papers (2021-06-11T16:49:15Z) - ERMAS: Becoming Robust to Reward Function Sim-to-Real Gaps in
Multi-Agent Simulations [110.72725220033983]
Epsilon-Robust Multi-Agent Simulation (ERMAS) is a framework for learning AI policies that are robust to such multiagent sim-to-real gaps.
ERMAS learns tax policies that are robust to changes in agent risk aversion, improving social welfare by up to 15% in complextemporal simulations.
In particular, ERMAS learns tax policies that are robust to changes in agent risk aversion, improving social welfare by up to 15% in complextemporal simulations.
arXiv Detail & Related papers (2021-06-10T04:32:20Z) - Heterogeneous-Agent Trajectory Forecasting Incorporating Class
Uncertainty [54.88405167739227]
We present HAICU, a method for heterogeneous-agent trajectory forecasting that explicitly incorporates agents' class probabilities.
We additionally present PUP, a new challenging real-world autonomous driving dataset.
We demonstrate that incorporating class probabilities in trajectory forecasting significantly improves performance in the face of uncertainty.
arXiv Detail & Related papers (2021-04-26T10:28:34Z) - Deciding What to Learn: A Rate-Distortion Approach [21.945359614094503]
In a complex environment, aiming to synthesize an optimal policy can become infeasible.
We automate the process of translating a designer's preferences into a fixed learning target for an agent.
We show improvements over Thompson sampling in identifying an optimal policy.
arXiv Detail & Related papers (2021-01-15T16:22:49Z) - Performance of Bounded-Rational Agents With the Ability to Self-Modify [1.933681537640272]
Self-modification of agents embedded in complex environments is hard to avoid.
It has been argued that intelligent agents have an incentive to avoid modifying their utility function so that their future instances work towards the same goals.
We show that this result is no longer true for agents with bounded rationality.
arXiv Detail & Related papers (2020-11-12T09:25:08Z) - Curiosity Killed or Incapacitated the Cat and the Asymptotically Optimal
Agent [21.548271801592907]
Reinforcement learners are agents that learn to pick actions that lead to high reward.
We show that if an agent is guaranteed to be "asymptotically optimal" in any environment, then subject to an assumption about the true environment, this agent will be either "destroyed" or "incapacitated"
We present an agent, Mentee, with the modest guarantee of approaching the performance of a mentor, doing safe exploration instead of reckless exploration.
arXiv Detail & Related papers (2020-06-05T10:42:29Z) - Maximizing Information Gain in Partially Observable Environments via
Prediction Reward [64.24528565312463]
This paper tackles the challenge of using belief-based rewards for a deep RL agent.
We derive the exact error between negative entropy and the expected prediction reward.
This insight provides theoretical motivation for several fields using prediction rewards.
arXiv Detail & Related papers (2020-05-11T08:13:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.