Scalar reward is not enough: A response to Silver, Singh, Precup and
Sutton (2021)
- URL: http://arxiv.org/abs/2112.15422v1
- Date: Thu, 25 Nov 2021 00:58:23 GMT
- Title: Scalar reward is not enough: A response to Silver, Singh, Precup and
Sutton (2021)
- Authors: Peter Vamplew, Benjamin J. Smith, Johan Kallstrom, Gabriel Ramos,
Roxana Radulescu, Diederik M. Roijers, Conor F. Hayes, Fredrik Heintz,
Patrick Mannion, Pieter J.K. Libin, Richard Dazeley, Cameron Foale
- Abstract summary: We argue that scalar rewards are insufficient to account for some aspects of both biological and computational intelligence.
It is still undesirable to use this approach for the development of artificial general intelligence due to unacceptable risks of unsafe or unethical behaviour.
- Score: 5.377016988002648
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recent paper `"Reward is Enough" by Silver, Singh, Precup and Sutton
posits that the concept of reward maximisation is sufficient to underpin all
intelligence, both natural and artificial. We contest the underlying assumption
of Silver et al. that such reward can be scalar-valued. In this paper we
explain why scalar rewards are insufficient to account for some aspects of both
biological and computational intelligence, and argue in favour of explicitly
multi-objective models of reward maximisation. Furthermore, we contend that
even if scalar reward functions can trigger intelligent behaviour in specific
cases, it is still undesirable to use this approach for the development of
artificial general intelligence due to unacceptable risks of unsafe or
unethical behaviour.
Related papers
- Deceptive Sequential Decision-Making via Regularized Policy Optimization [54.38738815697299]
Two regularization strategies for policy synthesis problems that actively deceive an adversary about a system's underlying rewards are presented.
We show how each form of deception can be implemented in policy optimization problems.
We show that diversionary deception can cause the adversary to believe that the most important agent is the least important, while attaining a total accumulated reward that is $98.83%$ of its optimal, non-deceptive value.
arXiv Detail & Related papers (2025-01-30T23:41:40Z) - Walking the Values in Bayesian Inverse Reinforcement Learning [66.68997022043075]
Key challenge in Bayesian IRL is bridging the computational gap between the hypothesis space of possible rewards and the likelihood.
We propose ValueWalk - a new Markov chain Monte Carlo method based on this insight.
arXiv Detail & Related papers (2024-07-15T17:59:52Z) - Multi Task Inverse Reinforcement Learning for Common Sense Reward [21.145179791929337]
We show that inverse reinforcement learning, even when it succeeds in training an agent, does not learn a useful reward function.
That is, training a new agent with the learned reward does not impair the desired behaviors.
That is, multi-task inverse reinforcement learning can be applied to learn a useful reward function.
arXiv Detail & Related papers (2024-02-17T19:49:00Z) - Robust and Performance Incentivizing Algorithms for Multi-Armed Bandits
with Strategic Agents [57.627352949446625]
We consider a variant of the multi-armed bandit problem.
Specifically, the arms are strategic agents who can improve their rewards or absorb them.
We identify a class of MAB algorithms which satisfy a collection of properties and show that they lead to mechanisms that incentivize top level performance at equilibrium.
arXiv Detail & Related papers (2023-12-13T06:54:49Z) - STARC: A General Framework For Quantifying Differences Between Reward Functions [52.69620361363209]
We provide a class of pseudometrics on the space of all reward functions that we call STARC metrics.
We show that STARC metrics induce both an upper and a lower bound on worst-case regret.
We also identify a number of issues with reward metrics proposed by earlier works.
arXiv Detail & Related papers (2023-09-26T20:31:19Z) - Go Beyond Imagination: Maximizing Episodic Reachability with World
Models [68.91647544080097]
In this paper, we introduce a new intrinsic reward design called GoBI - Go Beyond Imagination.
We apply learned world models to generate predicted future states with random actions.
Our method greatly outperforms previous state-of-the-art methods on 12 of the most challenging Minigrid navigation tasks.
arXiv Detail & Related papers (2023-08-25T20:30:20Z) - Tiered Reward: Designing Rewards for Specification and Fast Learning of Desired Behavior [13.409265335314169]
Tiered Reward is a class of environment-independent reward functions.
We show it is guaranteed to induce policies that are optimal according to our preference relation.
arXiv Detail & Related papers (2022-12-07T15:55:00Z) - Automatic Reward Design via Learning Motivation-Consistent Intrinsic
Rewards [46.068337522093096]
We introduce the concept of motivation which captures the underlying goal of maximizing certain rewards.
Our method performs better than the state-of-the-art methods in handling problems of delayed reward, exploration, and credit assignment.
arXiv Detail & Related papers (2022-07-29T14:52:02Z) - Reward is not enough: can we liberate AI from the reinforcement learning paradigm? [0.0]
Reward is not enough to explain many activities associated with natural and artificial intelligence.
Complexities of intelligent behaviour are not simply second-order complications on top of reward maximisation.
arXiv Detail & Related papers (2022-02-03T18:31:48Z) - Mutual Information State Intrinsic Control [91.38627985733068]
Intrinsically motivated RL attempts to remove this constraint by defining an intrinsic reward function.
Motivated by the self-consciousness concept in psychology, we make a natural assumption that the agent knows what constitutes itself.
We mathematically formalize this reward as the mutual information between the agent state and the surrounding state.
arXiv Detail & Related papers (2021-03-15T03:03:36Z) - Semi-supervised reward learning for offline reinforcement learning [71.6909757718301]
Training agents usually requires reward functions, but rewards are seldom available in practice and their engineering is challenging and laborious.
We propose semi-supervised learning algorithms that learn from limited annotations and incorporate unlabelled data.
In our experiments with a simulated robotic arm, we greatly improve upon behavioural cloning and closely approach the performance achieved with ground truth rewards.
arXiv Detail & Related papers (2020-12-12T20:06:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.