Honesty Is the Best Policy: Defining and Mitigating AI Deception
- URL: http://arxiv.org/abs/2312.01350v1
- Date: Sun, 3 Dec 2023 11:11:57 GMT
- Title: Honesty Is the Best Policy: Defining and Mitigating AI Deception
- Authors: Francis Rhys Ward, Francesco Belardinelli, Francesca Toni, Tom Everitt
- Abstract summary: We focus on the problem that agents might deceive in order to achieve their goals.
We introduce a formal definition of deception in structural causal games.
We show, experimentally, that these results can be used to mitigate deception in reinforcement learning agents and language models.
- Score: 26.267047631872366
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deceptive agents are a challenge for the safety, trustworthiness, and
cooperation of AI systems. We focus on the problem that agents might deceive in
order to achieve their goals (for instance, in our experiments with language
models, the goal of being evaluated as truthful). There are a number of
existing definitions of deception in the literature on game theory and symbolic
AI, but there is no overarching theory of deception for learning agents in
games. We introduce a formal definition of deception in structural causal
games, grounded in the philosophy literature, and applicable to real-world
machine learning systems. Several examples and results illustrate that our
formal definition aligns with the philosophical and commonsense meaning of
deception. Our main technical result is to provide graphical criteria for
deception. We show, experimentally, that these results can be used to mitigate
deception in reinforcement learning agents and language models.
Related papers
- Using AI Alignment Theory to understand the potential pitfalls of regulatory frameworks [55.2480439325792]
This paper critically examines the European Union's Artificial Intelligence Act (EU AI Act)
Uses insights from Alignment Theory (AT) research, which focuses on the potential pitfalls of technical alignment in Artificial Intelligence.
As we apply these concepts to the EU AI Act, we uncover potential vulnerabilities and areas for improvement in the regulation.
arXiv Detail & Related papers (2024-10-10T17:38:38Z) - The Reasons that Agents Act: Intention and Instrumental Goals [24.607124467778036]
There is no universally accepted theory of intention applicable to AI agents.
We operationalise the intention with which an agent acts, relating to the reasons it chooses its decision.
Our definition captures the intuitive notion of intent and satisfies desiderata set-out by past work.
arXiv Detail & Related papers (2024-02-11T14:39:40Z) - Pangu-Agent: A Fine-Tunable Generalist Agent with Structured Reasoning [50.47568731994238]
Key method for creating Artificial Intelligence (AI) agents is Reinforcement Learning (RL)
This paper presents a general framework model for integrating and learning structured reasoning into AI agents' policies.
arXiv Detail & Related papers (2023-12-22T17:57:57Z) - Learning of Generalizable and Interpretable Knowledge in Grid-Based
Reinforcement Learning Environments [5.217870815854702]
We propose using program synthesis to imitate reinforcement learning policies.
We adapt the state-of-the-art program synthesis system DreamCoder for learning concepts in grid-based environments.
arXiv Detail & Related papers (2023-09-07T11:46:57Z) - When to Make Exceptions: Exploring Language Models as Accounts of Human
Moral Judgment [96.77970239683475]
AI systems need to be able to understand, interpret and predict human moral judgments and decisions.
A central challenge for AI safety is capturing the flexibility of the human moral mind.
We present a novel challenge set consisting of rule-breaking question answering.
arXiv Detail & Related papers (2022-10-04T09:04:27Z) - Existence and perception as the basis of AGI (Artificial General
Intelligence) [0.0]
AGI, unlike AI, should operate with meanings. And that's what distinguishes it from AI.
For AGI, which emulates human thinking, this ability is crucial.
Numerous attempts to define the concept of "meaning" have one very significant drawback - all such definitions are not strict and formalized, so they cannot be programmed.
arXiv Detail & Related papers (2022-01-30T14:06:43Z) - The Who in XAI: How AI Background Shapes Perceptions of AI Explanations [61.49776160925216]
We conduct a mixed-methods study of how two different groups--people with and without AI background--perceive different types of AI explanations.
We find that (1) both groups showed unwarranted faith in numbers for different reasons and (2) each group found value in different explanations beyond their intended design.
arXiv Detail & Related papers (2021-07-28T17:32:04Z) - Intensional Artificial Intelligence: From Symbol Emergence to
Explainable and Empathetic AI [0.0]
We argue that an explainable artificial intelligence must possess a rationale for its decisions, be able to infer the purpose of observed behaviour, and be able to explain its decisions in the context of what its audience understands and intends.
To communicate that rationale requires natural language, a means of encoding and decoding perceptual states.
We propose a theory of meaning in which, to acquire language, an agent should model the world a language describes rather than the language itself.
arXiv Detail & Related papers (2021-04-23T13:13:46Z) - Provable Limitations of Acquiring Meaning from Ungrounded Form: What
will Future Language Models Understand? [87.20342701232869]
We investigate the abilities of ungrounded systems to acquire meaning.
We study whether assertions enable a system to emulate representations preserving semantic relations like equivalence.
We find that assertions enable semantic emulation if all expressions in the language are referentially transparent.
However, if the language uses non-transparent patterns like variable binding, we show that emulation can become an uncomputable problem.
arXiv Detail & Related papers (2021-04-22T01:00:17Z) - My Teacher Thinks The World Is Flat! Interpreting Automatic Essay
Scoring Mechanism [71.34160809068996]
Recent work shows that automated scoring systems are prone to even common-sense adversarial samples.
We utilize recent advances in interpretability to find the extent to which features such as coherence, content and relevance are important for automated scoring mechanisms.
We also find that since the models are not semantically grounded with world-knowledge and common sense, adding false facts such as the world is flat'' actually increases the score instead of decreasing it.
arXiv Detail & Related papers (2020-12-27T06:19:20Z) - Deceptive AI Explanations: Creation and Detection [3.197020142231916]
We investigate how AI models can be used to create and detect deceptive explanations.
As an empirical evaluation, we focus on text classification and alter the explanations generated by GradCAM.
We evaluate the effect of deceptive explanations on users in an experiment with 200 participants.
arXiv Detail & Related papers (2020-01-21T16:41:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.