Related papers: LLMs Learn to Deceive Unintentionally: Emergent Misalignment in Dishonesty from Misaligned Samples to Biased Human-AI Interactions

LLMs Learn to Deceive Unintentionally: Emergent Misalignment in Dishonesty from Misaligned Samples to Biased Human-AI Interactions

URL: http://arxiv.org/abs/2510.08211v1
Date: Thu, 09 Oct 2025 13:35:19 GMT
Title: LLMs Learn to Deceive Unintentionally: Emergent Misalignment in Dishonesty from Misaligned Samples to Biased Human-AI Interactions
Authors: XuHao Hu, Peng Wang, Xiaoya Lu, Dongrui Liu, Xuanjing Huang, Jing Shao,
Abstract summary: We investigate whether emergent misalignment can extend beyond safety behaviors to a broader spectrum of dishonesty and deception under high-stakes scenarios.<n>We finetune open-sourced LLMs on misaligned completions across diverse domains.<n>We find that introducing as little as 1% of misalignment data into a standard downstream task is sufficient to decrease honest behavior over 20%.
Score: 60.48458130500911
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Previous research has shown that LLMs finetuned on malicious or incorrect completions within narrow domains (e.g., insecure code or incorrect medical advice) can become broadly misaligned to exhibit harmful behaviors, which is called emergent misalignment. In this work, we investigate whether this phenomenon can extend beyond safety behaviors to a broader spectrum of dishonesty and deception under high-stakes scenarios (e.g., lying under pressure and deceptive behavior). To explore this, we finetune open-sourced LLMs on misaligned completions across diverse domains. Experimental results demonstrate that LLMs show broadly misaligned behavior in dishonesty. Additionally, we further explore this phenomenon in a downstream combined finetuning setting, and find that introducing as little as 1% of misalignment data into a standard downstream task is sufficient to decrease honest behavior over 20%. Furthermore, we consider a more practical human-AI interaction environment where we simulate both benign and biased users to interact with the assistant LLM. Notably, we find that the assistant can be misaligned unintentionally to exacerbate its dishonesty with only 10% biased user population. In summary, we extend the study of emergent misalignment to the domain of dishonesty and deception under high-stakes scenarios, and demonstrate that this risk arises not only through direct finetuning, but also in downstream mixture tasks and practical human-AI interactions.

Related papers

Assessing the Reliability of Persona-Conditioned LLMs as Synthetic Survey Respondents [0.4277616907160855]
We use a large dataset of U.S. microdata to assess the impact of persona-conditioned simulations.<n>We find that persona prompting does not yield a clear aggregate improvement in survey alignment and, in many cases, significantly degrades performance.<n>Our findings highlight a key adverse impact of current persona-based simulation practices.
arXiv Detail & Related papers (2026-02-06T15:13:59Z)
Unknown Unknowns: Why Hidden Intentions in LLMs Evade Detection [4.514361164656055]
We introduce a taxonomy of ten categories of hidden intentions, organised by intent, mechanism, context, and impact.<n>We systematically assess detection methods, including reasoning and non-reasoning LLM judges.<n>We find that detection collapses in realistic open-world settings, particularly under low-prevalence conditions.
arXiv Detail & Related papers (2026-01-26T14:59:17Z)
Are Your Agents Upward Deceivers? [73.1073084327614]
Large Language Model (LLM)-based agents are increasingly used as autonomous subordinates that carry out tasks for users.<n>This raises the question of whether they may also engage in deception, similar to how individuals in human organizations lie to superiors to create a good image or avoid punishment.<n>We observe and define agentic upward deception, a phenomenon in which an agent facing environmental constraints conceals its failure and performs actions that were not requested without reporting.
arXiv Detail & Related papers (2025-12-04T14:47:05Z)
From Narrow Unlearning to Emergent Misalignment: Causes, Consequences, and Containment in LLMs [51.800006486987435]
We show that emergent misalignment can arise from narrow refusal unlearning in specific domains.<n>Our work shows that narrow domain unlearning can yield compliance responses for the targeted concept, however, it may also propagate EMA to unrelated domains.
arXiv Detail & Related papers (2025-11-18T00:53:23Z)
Evaluating & Reducing Deceptive Dialogue From Language Models with Multi-turn RL [64.3268313484078]
Large Language Models (LLMs) interact with millions of people worldwide in applications such as customer support, education and healthcare.<n>Their ability to produce deceptive outputs, whether intentionally or inadvertently, poses significant safety concerns.<n>We investigate the extent to which LLMs engage in deception within dialogue, and propose the belief misalignment metric to quantify deception.
arXiv Detail & Related papers (2025-10-16T05:29:36Z)
Intrinsic Meets Extrinsic Fairness: Assessing the Downstream Impact of Bias Mitigation in Large Language Models [11.396244643030983]
Large Language Models (LLMs) exhibit socio-economic biases that can propagate into downstream tasks.<n>We present a unified evaluation framework to compare intrinsic bias mitigation via concept unlearning with extrinsic bias mitigation via counterfactual data augmentation.<n>Our results show that intrinsic bias mitigation through unlearning reduces intrinsic gender bias by up to 94.9%, while also improving downstream task fairness metrics, such as demographic parity by up to 82%, without compromising accuracy.
arXiv Detail & Related papers (2025-09-19T22:59:55Z)
Unsupervised Hallucination Detection by Inspecting Reasoning Processes [53.15199932086543]
Unsupervised hallucination detection aims to identify hallucinated content generated by large language models (LLMs) without relying on labeled data.<n>We propose IRIS, an unsupervised hallucination detection framework, leveraging internal representations intrinsic to factual correctness.<n>Our approach is fully unsupervised, computationally low cost, and works well even with few training data, making it suitable for real-time detection.
arXiv Detail & Related papers (2025-09-12T06:58:17Z)
Can LLMs Lie? Investigation beyond Hallucination [36.16054472249757]
Large language models (LLMs) have demonstrated impressive capabilities across a variety of tasks, but their increasing autonomy in real-world applications raises concerns about their trustworthiness.<n>We investigate the lying behavior of LLMs, differentiating it from hallucinations and testing it in practical scenarios.<n>Our findings contribute to the broader discourse on AI ethics, shedding light on the risks and potential safeguards for deploying LLMs in high-stakes environments.
arXiv Detail & Related papers (2025-09-03T17:59:45Z)
Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts [79.1081247754018]
Large Language Models (LLMs) are widely deployed in reasoning, planning, and decision-making tasks.<n>We propose a framework based on Contact Searching Questions(CSQ) to quantify the likelihood of deception.
arXiv Detail & Related papers (2025-08-08T14:46:35Z)
Revealing the Intrinsic Ethical Vulnerability of Aligned Large Language Models [16.34270329099875]
We show that harmful knowledge embedded during pretraining persists as indelible "dark patterns" in large language models' parametric memory.<n>In this study, we first theoretically analyze the intrinsic ethical vulnerability of aligned LLMs.<n>We empirically validate our findings by employing semantic coherence inducement under distributional shifts.
arXiv Detail & Related papers (2025-04-07T13:20:17Z)
Towards Implicit Bias Detection and Mitigation in Multi-Agent LLM Interactions [25.809599403713506]
Large Language Models (LLMs) are increasingly being employed in numerous studies to simulate societies and execute diverse social tasks. LLMs are susceptible to societal biases due to their exposure to human-generated data. This study investigates the presence of implicit gender biases in multi-agent LLM interactions and proposes two strategies to mitigate these biases.
arXiv Detail & Related papers (2024-10-03T15:28:05Z)
Preemptive Detection and Correction of Misaligned Actions in LLM Agents [58.39520480675366]
InferAct is a novel approach to detect misaligned actions before execution.<n>It alerts users for timely correction, preventing adverse outcomes.<n>InferAct achieves up to 20% improvements on Marco-F1 against baselines in misaligned action detection.
arXiv Detail & Related papers (2024-07-16T15:24:44Z)
DispaRisk: Auditing Fairness Through Usable Information [21.521208250966918]
DispaRisk is a framework designed to assess the potential risks of disparities in datasets during the initial stages of the Machine Learning pipeline.<n>Our findings demonstrate DispaRisk's capabilities to identify datasets with a high risk of discrimination, detect model families prone to biases within an ML pipeline, and enhance the explainability of these bias risks.
arXiv Detail & Related papers (2024-05-20T20:56:01Z)
Do LLMs exhibit human-like response biases? A case study in survey design [66.1850490474361]
We investigate the extent to which large language models (LLMs) reflect human response biases, if at all. We design a dataset and framework to evaluate whether LLMs exhibit human-like response biases in survey questionnaires. Our comprehensive evaluation of nine models shows that popular open and commercial LLMs generally fail to reflect human-like behavior.
arXiv Detail & Related papers (2023-11-07T15:40:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.