Related papers: Unknown Unknowns: Why Hidden Intentions in LLMs Evade Detection

Unknown Unknowns: Why Hidden Intentions in LLMs Evade Detection

URL: http://arxiv.org/abs/2601.18552v1
Date: Mon, 26 Jan 2026 14:59:17 GMT
Title: Unknown Unknowns: Why Hidden Intentions in LLMs Evade Detection
Authors: Devansh Srivastav, David Pape, Lea Schönherr,
Abstract summary: We introduce a taxonomy of ten categories of hidden intentions, organised by intent, mechanism, context, and impact.<n>We systematically assess detection methods, including reasoning and non-reasoning LLM judges.<n>We find that detection collapses in realistic open-world settings, particularly under low-prevalence conditions.
Score: 4.514361164656055
License: http://creativecommons.org/licenses/by/4.0/
Abstract: LLMs are increasingly embedded in everyday decision-making, yet their outputs can encode subtle, unintended behaviours that shape user beliefs and actions. We refer to these covert, goal-directed behaviours as hidden intentions, which may arise from training and optimisation artefacts, or be deliberately induced by an adversarial developer, yet remain difficult to detect in practice. We introduce a taxonomy of ten categories of hidden intentions, grounded in social science research and organised by intent, mechanism, context, and impact, shifting attention from surface-level behaviours to design-level strategies of influence. We show how hidden intentions can be easily induced in controlled models, providing both testbeds for evaluation and demonstrations of potential misuse. We systematically assess detection methods, including reasoning and non-reasoning LLM judges, and find that detection collapses in realistic open-world settings, particularly under low-prevalence conditions, where false positives overwhelm precision and false negatives conceal true risks. Stress tests on precision-prevalence and precision-FNR trade-offs reveal why auditing fails without vanishingly small false positive rates or strong priors on manipulation types. Finally, a qualitative case study shows that all ten categories manifest in deployed, state-of-the-art LLMs, emphasising the urgent need for robust frameworks. Our work provides the first systematic analysis of detectability failures of hidden intentions in LLMs under open-world settings, offering a foundation for understanding, inducing, and stress-testing such behaviours, and establishing a flexible taxonomy for anticipating evolving threats and informing governance.

Related papers

Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation [76.5533899503582]
Large language models (LLMs) are increasingly used as judges to evaluate agent performance.<n>We show this paradigm implicitly assumes that the agent's chain-of-thought (CoT) reasoning faithfully reflects both its internal reasoning and the underlying environment state.<n>We demonstrate that manipulated reasoning alone can inflate false positive rates of state-of-the-art VLM judges by up to 90% across 800 trajectories spanning diverse web tasks.
arXiv Detail & Related papers (2026-01-21T06:07:43Z)
Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process [66.38541693477181]
We propose an unsupervised framework for discovering reasoning vectors, which we define as directions in the activation space that encode distinct reasoning behaviors.<n>By segmenting chain-of-thought traces into sentence-level'steps', we uncover disentangled features corresponding to interpretable behaviors such as reflection and backtracking.<n>We demonstrate the ability to control response confidence by identifying confidence-related vectors in the SAE decoder space.
arXiv Detail & Related papers (2025-12-30T05:09:11Z)
MENTOR: A Metacognition-Driven Self-Evolution Framework for Uncovering and Mitigating Implicit Risks in LLMs on Domain Tasks [17.598413159363393]
Current alignment efforts primarily target explicit risks such as bias, hate speech, and violence.<n>We propose MENTOR: A MEtacognition-driveN self-evoluTion framework for uncOvering and mitigating implicit risks in large language models.<n>We release a supporting dataset of 9,000 risk queries spanning education, finance, and management to enhance domain-specific risk identification.
arXiv Detail & Related papers (2025-11-10T13:51:51Z)
Contamination Detection for VLMs using Multi-Modal Semantic Perturbation [73.76465227729818]
Open-source Vision-Language Models (VLMs) have achieved state-of-the-art performance on benchmark tasks.<n>Pretraining corpora raise a critical concern for both practitioners and users: inflated performance due to test-set leakage.<n>We show that existing detection approaches either fail outright or exhibit inconsistent behavior.<n>We propose a novel simple yet effective detection method based on multi-modal semantic perturbation.
arXiv Detail & Related papers (2025-11-05T18:59:52Z)
LLMs Learn to Deceive Unintentionally: Emergent Misalignment in Dishonesty from Misaligned Samples to Biased Human-AI Interactions [60.48458130500911]
We investigate whether emergent misalignment can extend beyond safety behaviors to a broader spectrum of dishonesty and deception under high-stakes scenarios.<n>We finetune open-sourced LLMs on misaligned completions across diverse domains.<n>We find that introducing as little as 1% of misalignment data into a standard downstream task is sufficient to decrease honest behavior over 20%.
arXiv Detail & Related papers (2025-10-09T13:35:19Z)
The Secret Agenda: LLMs Strategically Lie and Our Current Safety Tools Are Blind [0.23332469289621785]
Secret Agenda reliably induced lying when deception advantaged goal achievement across all model families.<n>Analysis revealed that autolabeled SAE features for "deception" rarely activated during strategic dishonesty.<n>Findings suggest autolabel-driven interpretability approaches fail to detect or control behavioral deception.
arXiv Detail & Related papers (2025-09-23T04:52:40Z)
Embedding Poisoning: Bypassing Safety Alignment via Embedding Semantic Shift [23.0914017433021]
This work identifies a novel class of deployment phase attacks that exploit a vulnerability by injecting imperceptible perturbations directly into the embedding layer outputs without modifying model weights or input text.<n>We propose Search based Embedding Poisoning, a practical, model agnostic framework that introduces carefully optimized perturbations into embeddings associated with high risk tokens.
arXiv Detail & Related papers (2025-09-08T05:00:58Z)
False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize [30.448801772258644]
Large Language Models (LLMs) can comply with harmful instructions, raising serious safety concerns despite their impressive capabilities.<n>Recent work has leveraged probing-based approaches to study the separability of malicious and benign inputs in LLMs' internal representations.<n>Motivated by poor out-of-distribution performance, we hypothesize that probes learn superficial patterns rather than semantic harmfulness.
arXiv Detail & Related papers (2025-09-04T05:15:55Z)
Preliminary Investigation into Uncertainty-Aware Attack Stage Classification [81.28215542218724]
This work addresses the problem of attack stage inference under uncertainty.<n>We propose a classification approach based on Evidential Deep Learning (EDL), which models predictive uncertainty by outputting parameters of a Dirichlet distribution over possible stages.<n>Preliminary experiments in a simulated environment demonstrate that the proposed model can accurately infer the stage of an attack with confidence.
arXiv Detail & Related papers (2025-08-01T06:58:00Z)
Understanding the Effectiveness of Coverage Criteria for Large Language Models: A Special Angle from Jailbreak Attacks [10.909463767558023]
Large language models (LLMs) have revolutionized artificial intelligence, but their deployment across critical domains has raised concerns about their abnormal behaviors when faced with malicious attacks.<n>In this paper, we conduct a comprehensive empirical study to evaluate the effectiveness of traditional coverage criteria in identifying such inadequacies.<n>We develop a real-time jailbreak detection mechanism that achieves high accuracy (93.61% on average) in classifying queries as normal or jailbreak.
arXiv Detail & Related papers (2024-08-27T17:14:21Z)
Detecting and Understanding Vulnerabilities in Language Models via Mechanistic Interpretability [44.99833362998488]
Large Language Models (LLMs) have shown impressive performance across a wide range of tasks. LLMs in particular are known to be vulnerable to adversarial attacks, where an imperceptible change to the input can mislead the output of the model. We propose a method, based on Mechanistic Interpretability (MI) techniques, to guide this process.
arXiv Detail & Related papers (2024-07-29T09:55:34Z)
Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning [61.2224355547598]
Open-sourcing of large language models (LLMs) accelerates application development, innovation, and scientific progress. Our investigation exposes a critical oversight in this belief. By deploying carefully designed demonstrations, our research demonstrates that base LLMs could effectively interpret and execute malicious instructions.
arXiv Detail & Related papers (2024-04-16T13:22:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.