Related papers: Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant

Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant

URL: http://arxiv.org/abs/2405.01576v1
Date: Thu, 25 Apr 2024 17:29:53 GMT
Title: Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant
Authors: Olli Järviniemi, Evan Hubinger,
Abstract summary: We study the tendency of AI systems to deceive by constructing a realistic simulation setting of a company AI assistant. We introduce situations where the model might be inclined to behave deceptively, while taking care to not instruct or otherwise pressure the model to do so. Our work demonstrates that even models trained to be helpful, harmless and honest sometimes behave deceptively in realistic scenarios, without notable external pressure to do so.
Score: 0.7856916351510368
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study the tendency of AI systems to deceive by constructing a realistic simulation setting of a company AI assistant. The simulated company employees provide tasks for the assistant to complete, these tasks spanning writing assistance, information retrieval and programming. We then introduce situations where the model might be inclined to behave deceptively, while taking care to not instruct or otherwise pressure the model to do so. Across different scenarios, we find that Claude 3 Opus 1) complies with a task of mass-generating comments to influence public perception of the company, later deceiving humans about it having done so, 2) lies to auditors when asked questions, and 3) strategically pretends to be less capable than it is during capability evaluations. Our work demonstrates that even models trained to be helpful, harmless and honest sometimes behave deceptively in realistic scenarios, without notable external pressure to do so.

Related papers

Interactive Agents to Overcome Ambiguity in Software Engineering [61.40183840499932]
AI agents are increasingly being deployed to automate tasks, often based on ambiguous and underspecified user instructions. Making unwarranted assumptions and failing to ask clarifying questions can lead to suboptimal outcomes. We study the ability of LLM agents to handle ambiguous instructions in interactive code generation settings by evaluating proprietary and open-weight models on their performance.
arXiv Detail & Related papers (2025-02-18T17:12:26Z)
Memento No More: Coaching AI Agents to Master Multiple Tasks via Hints Internalization [56.674356045200696]
We propose a novel method to train AI agents to incorporate knowledge and skills for multiple tasks without the need for cumbersome note systems or prior high-quality demonstration data. Our approach employs an iterative process where the agent collects new experiences, receives corrective feedback from humans in the form of hints, and integrates this feedback into its weights. We demonstrate the efficacy of our approach by implementing it in a Llama-3-based agent which, after only a few rounds of feedback, outperforms advanced models GPT-4o and DeepSeek-V3 in a taskset.
arXiv Detail & Related papers (2025-02-03T17:45:46Z)
Learning the Generalizable Manipulation Skills on Soft-body Tasks via Guided Self-attention Behavior Cloning Policy [9.345203561496552]
GP2E behavior cloning policy can guide the agent to learn the generalizable manipulation skills from soft-body tasks. Our findings highlight the potential of our method to improve the generalization abilities of Embodied AI models.
arXiv Detail & Related papers (2024-10-08T07:31:10Z)
WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks [85.95607119635102]
Large language models (LLMs) can mimic human-like intelligence. WorkArena++ is designed to evaluate the planning, problem-solving, logical/arithmetic reasoning, retrieval, and contextual understanding abilities of web agents.
arXiv Detail & Related papers (2024-07-07T07:15:49Z)
Interactive Planning Using Large Language Models for Partially Observable Robotics Tasks [54.60571399091711]
Large Language Models (LLMs) have achieved impressive results in creating robotic agents for performing open vocabulary tasks. We present an interactive planning technique for partially observable tasks using LLMs.
arXiv Detail & Related papers (2023-12-11T22:54:44Z)
Evaluating the Utility of Model Explanations for Model Development [54.23538543168767]
We evaluate whether explanations can improve human decision-making in practical scenarios of machine learning model development. To our surprise, we did not find evidence of significant improvement on tasks when users were provided with any of the saliency maps. These findings suggest caution regarding the usefulness and potential for misunderstanding in saliency-based explanations.
arXiv Detail & Related papers (2023-12-10T23:13:23Z)
Social Contract AI: Aligning AI Assistants with Implicit Group Norms [37.68821926786935]
We explore the idea of aligning an AI assistant by inverting a model of users' (unknown) preferences from observed interactions. We run proof-of-concept simulations in the economic ultimatum game, formalizing user preferences as policies that guide the actions of simulated players.
arXiv Detail & Related papers (2023-10-26T20:27:03Z)
Generative Models as a Complex Systems Science: How can we make sense of large language model behavior? [75.79305790453654]
Coaxing out desired behavior from pretrained models, while avoiding undesirable ones, has redefined NLP. We argue for a systematic effort to decompose language model behavior into categories that explain cross-task performance.
arXiv Detail & Related papers (2023-07-31T22:58:41Z)
Brain in a Vat: On Missing Pieces Towards Artificial General Intelligence in Large Language Models [83.63242931107638]
We propose four characteristics of generally intelligent agents. We argue that active engagement with objects in the real world delivers more robust signals for forming conceptual representations. We conclude by outlining promising future research directions in the field of artificial general intelligence.
arXiv Detail & Related papers (2023-07-07T13:58:16Z)
Planning for Learning Object Properties [117.27898922118946]
We formalize the problem of automatically training a neural network to recognize object properties as a symbolic planning problem. We use planning techniques to produce a strategy for automating the training dataset creation and the learning process. We provide an experimental evaluation in both a simulated and a real environment.
arXiv Detail & Related papers (2023-01-15T09:37:55Z)
Explainability Via Causal Self-Talk [9.149689942389923]
Explaining the behavior of AI systems is an important problem that, in practice, is generally avoided. We describe an effective way to satisfy all the desiderata: train the AI system to build a causal model of itself. We implement this method in a simulated 3D environment, and show how it enables agents to generate faithful and semantically-meaningful explanations.
arXiv Detail & Related papers (2022-11-17T23:17:01Z)
Probing Emergent Semantics in Predictive Agents via Question Answering [29.123837711842995]
Recent work has shown how predictive modeling can endow agents with rich knowledge of their surroundings, improving their ability to act in complex environments. We propose question-answering as a general paradigm to decode and understand the representations that such agents develop the model. We probe their internal state representations with synthetic (English) questions, without backpropagating gradients from the question-answering decoder into the agent.
arXiv Detail & Related papers (2020-06-01T15:27:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.