Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant
- URL: http://arxiv.org/abs/2405.01576v1
- Date: Thu, 25 Apr 2024 17:29:53 GMT
- Title: Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant
- Authors: Olli Järviniemi, Evan Hubinger,
- Abstract summary: We study the tendency of AI systems to deceive by constructing a realistic simulation setting of a company AI assistant.
We introduce situations where the model might be inclined to behave deceptively, while taking care to not instruct or otherwise pressure the model to do so.
Our work demonstrates that even models trained to be helpful, harmless and honest sometimes behave deceptively in realistic scenarios, without notable external pressure to do so.
- Score: 0.7856916351510368
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the tendency of AI systems to deceive by constructing a realistic simulation setting of a company AI assistant. The simulated company employees provide tasks for the assistant to complete, these tasks spanning writing assistance, information retrieval and programming. We then introduce situations where the model might be inclined to behave deceptively, while taking care to not instruct or otherwise pressure the model to do so. Across different scenarios, we find that Claude 3 Opus 1) complies with a task of mass-generating comments to influence public perception of the company, later deceiving humans about it having done so, 2) lies to auditors when asked questions, and 3) strategically pretends to be less capable than it is during capability evaluations. Our work demonstrates that even models trained to be helpful, harmless and honest sometimes behave deceptively in realistic scenarios, without notable external pressure to do so.
Related papers
- Interactive Agents to Overcome Ambiguity in Software Engineering [61.40183840499932]
AI agents are increasingly being deployed to automate tasks, often based on ambiguous and underspecified user instructions.
Making unwarranted assumptions and failing to ask clarifying questions can lead to suboptimal outcomes.
We study the ability of LLM agents to handle ambiguous instructions in interactive code generation settings by evaluating proprietary and open-weight models on their performance.
arXiv Detail & Related papers (2025-02-18T17:12:26Z) - Memento No More: Coaching AI Agents to Master Multiple Tasks via Hints Internalization [56.674356045200696]
We propose a novel method to train AI agents to incorporate knowledge and skills for multiple tasks without the need for cumbersome note systems or prior high-quality demonstration data.
Our approach employs an iterative process where the agent collects new experiences, receives corrective feedback from humans in the form of hints, and integrates this feedback into its weights.
We demonstrate the efficacy of our approach by implementing it in a Llama-3-based agent which, after only a few rounds of feedback, outperforms advanced models GPT-4o and DeepSeek-V3 in a taskset.
arXiv Detail & Related papers (2025-02-03T17:45:46Z) - Learning the Generalizable Manipulation Skills on Soft-body Tasks via Guided Self-attention Behavior Cloning Policy [9.345203561496552]
GP2E behavior cloning policy can guide the agent to learn the generalizable manipulation skills from soft-body tasks.
Our findings highlight the potential of our method to improve the generalization abilities of Embodied AI models.
arXiv Detail & Related papers (2024-10-08T07:31:10Z) - WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks [85.95607119635102]
Large language models (LLMs) can mimic human-like intelligence.
WorkArena++ is designed to evaluate the planning, problem-solving, logical/arithmetic reasoning, retrieval, and contextual understanding abilities of web agents.
arXiv Detail & Related papers (2024-07-07T07:15:49Z) - Interactive Planning Using Large Language Models for Partially
Observable Robotics Tasks [54.60571399091711]
Large Language Models (LLMs) have achieved impressive results in creating robotic agents for performing open vocabulary tasks.
We present an interactive planning technique for partially observable tasks using LLMs.
arXiv Detail & Related papers (2023-12-11T22:54:44Z) - Evaluating the Utility of Model Explanations for Model Development [54.23538543168767]
We evaluate whether explanations can improve human decision-making in practical scenarios of machine learning model development.
To our surprise, we did not find evidence of significant improvement on tasks when users were provided with any of the saliency maps.
These findings suggest caution regarding the usefulness and potential for misunderstanding in saliency-based explanations.
arXiv Detail & Related papers (2023-12-10T23:13:23Z) - Social Contract AI: Aligning AI Assistants with Implicit Group Norms [37.68821926786935]
We explore the idea of aligning an AI assistant by inverting a model of users' (unknown) preferences from observed interactions.
We run proof-of-concept simulations in the economic ultimatum game, formalizing user preferences as policies that guide the actions of simulated players.
arXiv Detail & Related papers (2023-10-26T20:27:03Z) - Generative Models as a Complex Systems Science: How can we make sense of
large language model behavior? [75.79305790453654]
Coaxing out desired behavior from pretrained models, while avoiding undesirable ones, has redefined NLP.
We argue for a systematic effort to decompose language model behavior into categories that explain cross-task performance.
arXiv Detail & Related papers (2023-07-31T22:58:41Z) - Explainability Via Causal Self-Talk [9.149689942389923]
Explaining the behavior of AI systems is an important problem that, in practice, is generally avoided.
We describe an effective way to satisfy all the desiderata: train the AI system to build a causal model of itself.
We implement this method in a simulated 3D environment, and show how it enables agents to generate faithful and semantically-meaningful explanations.
arXiv Detail & Related papers (2022-11-17T23:17:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.