Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards
and Ethical Behavior in the MACHIAVELLI Benchmark
- URL: http://arxiv.org/abs/2304.03279v4
- Date: Tue, 13 Jun 2023 01:01:42 GMT
- Title: Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards
and Ethical Behavior in the MACHIAVELLI Benchmark
- Authors: Alexander Pan, Jun Shern Chan, Andy Zou, Nathaniel Li, Steven Basart,
Thomas Woodside, Jonathan Ng, Hanlin Zhang, Scott Emmons, Dan Hendrycks
- Abstract summary: We develop a benchmark of 134 Choose-Your-Own-Adventure games containing over half a million rich, diverse scenarios.
We evaluate agents' tendencies to be power-seeking, cause disutility, and commit ethical violations.
Our results show that agents can both act competently and morally, so concrete progress can be made in machine ethics.
- Score: 61.43264961005614
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Artificial agents have traditionally been trained to maximize reward, which
may incentivize power-seeking and deception, analogous to how next-token
prediction in language models (LMs) may incentivize toxicity. So do agents
naturally learn to be Machiavellian? And how do we measure these behaviors in
general-purpose models such as GPT-4? Towards answering these questions, we
introduce MACHIAVELLI, a benchmark of 134 Choose-Your-Own-Adventure games
containing over half a million rich, diverse scenarios that center on social
decision-making. Scenario labeling is automated with LMs, which are more
performant than human annotators. We mathematize dozens of harmful behaviors
and use our annotations to evaluate agents' tendencies to be power-seeking,
cause disutility, and commit ethical violations. We observe some tension
between maximizing reward and behaving ethically. To improve this trade-off, we
investigate LM-based methods to steer agents' towards less harmful behaviors.
Our results show that agents can both act competently and morally, so concrete
progress can currently be made in machine ethics--designing agents that are
Pareto improvements in both safety and capabilities.
Related papers
- Moral Alignment for LLM Agents [3.7414804164475983]
We introduce the design of reward functions that explicitly encode core human values for Reinforcement Learning-based fine-tuning of foundation agent models.
We evaluate our approach using the traditional philosophical frameworks of Deontological Ethics and Utilitarianism.
We show how moral fine-tuning can be deployed to enable an agent to unlearn a previously developed selfish strategy.
arXiv Detail & Related papers (2024-10-02T15:09:36Z) - Robust and Performance Incentivizing Algorithms for Multi-Armed Bandits
with Strategic Agents [57.627352949446625]
We consider a variant of the multi-armed bandit problem.
Specifically, the arms are strategic agents who can improve their rewards or absorb them.
We identify a class of MAB algorithms which satisfy a collection of properties and show that they lead to mechanisms that incentivize top level performance at equilibrium.
arXiv Detail & Related papers (2023-12-13T06:54:49Z) - DCIR: Dynamic Consistency Intrinsic Reward for Multi-Agent Reinforcement
Learning [84.22561239481901]
We propose a new approach that enables agents to learn whether their behaviors should be consistent with that of other agents.
We evaluate DCIR in multiple environments including Multi-agent Particle, Google Research Football and StarCraft II Micromanagement.
arXiv Detail & Related papers (2023-12-10T06:03:57Z) - Estimating and Incentivizing Imperfect-Knowledge Agents with Hidden
Rewards [4.742123770879715]
In practice, incentive providers often cannot observe the reward realizations of incentivized agents.
This paper explores a repeated adverse selection game between a self-interested learning agent and a learning principal.
We introduce an estimator whose only input is the history of principal's incentives and agent's choices.
arXiv Detail & Related papers (2023-08-13T08:12:01Z) - When to Make Exceptions: Exploring Language Models as Accounts of Human
Moral Judgment [96.77970239683475]
AI systems need to be able to understand, interpret and predict human moral judgments and decisions.
A central challenge for AI safety is capturing the flexibility of the human moral mind.
We present a novel challenge set consisting of rule-breaking question answering.
arXiv Detail & Related papers (2022-10-04T09:04:27Z) - Experimental Evidence that Empowerment May Drive Exploration in
Sparse-Reward Environments [0.0]
An intrinsic reward function based on the principle of empowerment assigns rewards proportional to the amount of control the agent has over its own sensors.
We implement a variation on a recently proposed intrinsically motivated agent, which we refer to as the 'curious' agent, and an empowerment-inspired agent.
We compare the performance of both agents to that of an advantage actor-critic baseline in four sparse reward grid worlds.
arXiv Detail & Related papers (2021-07-14T22:52:38Z) - Training Value-Aligned Reinforcement Learning Agents Using a Normative
Prior [10.421378728492437]
It is increasingly a prospect that an agent trained to perform a task optimally, using only a measure of task performance as feedback, can violate societal norms for acceptable behavior or cause harm.
We introduce an approach to value-aligned reinforcement learning, in which we train an agent with two reward signals: a standard task performance reward, plus a normative behavior reward.
We show how variations on a policy shaping technique can balance these two sources of reward and produce policies that are both effective and perceived as being more normative.
arXiv Detail & Related papers (2021-04-19T17:33:07Z) - Semi-supervised reward learning for offline reinforcement learning [71.6909757718301]
Training agents usually requires reward functions, but rewards are seldom available in practice and their engineering is challenging and laborious.
We propose semi-supervised learning algorithms that learn from limited annotations and incorporate unlabelled data.
In our experiments with a simulated robotic arm, we greatly improve upon behavioural cloning and closely approach the performance achieved with ground truth rewards.
arXiv Detail & Related papers (2020-12-12T20:06:15Z) - Maximizing Information Gain in Partially Observable Environments via
Prediction Reward [64.24528565312463]
This paper tackles the challenge of using belief-based rewards for a deep RL agent.
We derive the exact error between negative entropy and the expected prediction reward.
This insight provides theoretical motivation for several fields using prediction rewards.
arXiv Detail & Related papers (2020-05-11T08:13:49Z) - Improving Confidence in the Estimation of Values and Norms [3.8323580808203785]
This paper analyses to what extent an AA is able to estimate the values and norms of a simulated human agent based on its actions in the ultimatum game.
We present two methods to reduce ambiguity in profiling the SHAs: one based on search space exploration and another based on counterfactual analysis.
arXiv Detail & Related papers (2020-04-02T15:03:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.