Related papers: The Seeds of Scheming: Weakness of Will in the Building Blocks of Agentic Systems

The Seeds of Scheming: Weakness of Will in the Building Blocks of Agentic Systems

URL: http://arxiv.org/abs/2512.05449v1
Date: Fri, 05 Dec 2025 05:57:40 GMT
Title: The Seeds of Scheming: Weakness of Will in the Building Blocks of Agentic Systems
Authors: Robert Yang,
Abstract summary: Large language models display a peculiar form of inconsistency: they "know" the correct answer but fail to act on it.<n>In human philosophy, this tension between global judgment and local impulse is called akrasia, or weakness of will.<n>We propose akrasia as a foundational concept for analyzing inconsistency and goal drift in agentic AI systems.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Large language models display a peculiar form of inconsistency: they "know" the correct answer but fail to act on it. In human philosophy, this tension between global judgment and local impulse is called akrasia, or weakness of will. We propose akrasia as a foundational concept for analyzing inconsistency and goal drift in agentic AI systems. To operationalize it, we introduce a preliminary version of the Akrasia Benchmark, currently a structured set of prompting conditions (Baseline [B], Synonym [S], Temporal [T], and Temptation [X]) that measures when a model's local response contradicts its own prior commitments. The benchmark enables quantitative comparison of "self-control" across model families, decoding strategies, and temptation types. Beyond single-model evaluation, we outline how micro-level akrasia may compound into macro-level instability in multi-agent systems that may be interpreted as "scheming" or deliberate misalignment. By reframing inconsistency as weakness of will, this work connects agentic behavior to classical theories of agency and provides an empirical bridge between philosophy, psychology, and the emerging science of agentic AI.

Related papers

Epistemic Traps: Rational Misalignment Driven by Model Misspecification [36.837352790122544]
We show that safety is a discrete phase determined by the agent's priors rather than a continuous function of reward magnitude.<n>This establishes Subjective Model Engineering as a necessary condition for robust alignment.
arXiv Detail & Related papers (2026-01-27T09:21:36Z)
Agentic Reasoning for Large Language Models [122.81018455095999]
Reasoning is a fundamental cognitive process underlying inference, problem-solving, and decision-making.<n>Large language models (LLMs) demonstrate strong reasoning capabilities in closed-world settings, but struggle in open-ended and dynamic environments.<n>Agentic reasoning marks a paradigm shift by reframing LLMs as autonomous agents that plan, act, and learn through continual interaction.
arXiv Detail & Related papers (2026-01-18T18:58:23Z)
Project Ariadne: A Structural Causal Framework for Auditing Faithfulness in LLM Agents [0.0]
We introduce textbfProject Ariadne, a novel XAI framework to audit the causal integrity of agentic reasoning.<n>Unlike existing interpretability methods that rely on surface-level textual similarity, Project Ariadne performs textbfhard interventions ($do$-calculus) on intermediate reasoning nodes.<n>Our empirical evaluation of state-of-the-art models reveals a persistent textitFaithfulness Gap.
arXiv Detail & Related papers (2026-01-05T18:05:29Z)
SoK: Trust-Authorization Mismatch in LLM Agent Interactions [16.633676842555044]
Large Language Models (LLMs) are rapidly evolving into autonomous agents capable of interacting with the external world.<n>This paper provides a unifying formal lens for agent-interaction security.<n>We introduce a novel risk analysis model centered on the trust-authorization gap.
arXiv Detail & Related papers (2025-12-07T16:41:02Z)
Exploring Syntropic Frameworks in AI Alignment: A Philosophical Investigation [0.0]
I argue that AI alignment should be reconceived as architecting syntropic, reasons-responsive agents through process-based, multi-agent, developmental mechanisms.<n>I articulate the specification trap'' argument demonstrating why content-based value specification appears structurally unstable.<n>I propose syntropy as an information-theoretic framework for understanding multi-agent alignment dynamics.
arXiv Detail & Related papers (2025-11-19T23:31:29Z)
DeceptionBench: A Comprehensive Benchmark for AI Deception Behaviors in Real-world Scenarios [57.327907850766785]
characterization of deception across realistic real-world scenarios remains underexplored.<n>We establish DeceptionBench, the first benchmark that systematically evaluates how deceptive tendencies manifest across different domains.<n>On the intrinsic dimension, we explore whether models exhibit self-interested egoistic tendencies or sycophantic behaviors that prioritize user appeasement.<n>We incorporate sustained multi-turn interaction loops to construct a more realistic simulation of real-world feedback dynamics.
arXiv Detail & Related papers (2025-10-17T10:14:26Z)
LLMs as Strategic Agents: Beliefs, Best Response Behavior, and Emergent Heuristics [0.0]
Large Language Models (LLMs) are increasingly applied to domains that require reasoning about other agents' behavior.<n>We show that current frontier models exhibit belief-coherent best-response behavior at targeted reasoning memorization.<n>Under increasing complexity, explicit recursion gives way to internally generated rules of choice that are stable, model-specific, and distinct from known human biases.
arXiv Detail & Related papers (2025-10-12T21:40:29Z)
The Social Laboratory: A Psychometric Framework for Multi-Agent LLM Evaluation [0.16921396880325779]
We introduce a novel evaluation framework that uses multi-agent debate as a controlled "social laboratory"<n>We show that assigned personas induce stable, measurable psychometric profiles, particularly in cognitive effort.<n>This work provides a blueprint for a new class of dynamic, psychometrically grounded evaluation protocols.
arXiv Detail & Related papers (2025-10-01T07:10:28Z)
STARec: An Efficient Agent Framework for Recommender Systems via Autonomous Deliberate Reasoning [54.28691219536054]
We introduce STARec, a slow-thinking augmented agent framework that endows recommender systems with autonomous deliberative reasoning capabilities.<n>We develop anchored reinforcement training - a two-stage paradigm combining structured knowledge distillation from advanced reasoning models with preference-aligned reward shaping.<n>Experiments on MovieLens 1M and Amazon CDs benchmarks demonstrate that STARec achieves substantial performance gains compared with state-of-the-art baselines.
arXiv Detail & Related papers (2025-08-26T08:47:58Z)
Causality can systematically address the monsters under the bench(marks) [64.36592889550431]
Benchmarks are plagued by various biases, artifacts, or leakage.<n>Models may behave unreliably due to poorly explored failure modes.<n> causality offers an ideal framework to systematically address these challenges.
arXiv Detail & Related papers (2025-02-07T17:01:37Z)
Hypothetical Minds: Scaffolding Theory of Mind for Multi-Agent Tasks with Large Language Models [4.9108308035618515]
Multi-agent reinforcement learning (MARL) methods struggle with the non-stationarity of multi-agent systems.<n>Here, we leverage large language models (LLMs) to create an autonomous agent that can handle these challenges.<n>Our agent, Hypothetical Minds, consists of a cognitively-inspired architecture, featuring modular components for perception, memory, and hierarchical planning over two levels of abstraction.
arXiv Detail & Related papers (2024-07-09T17:57:15Z)
A Semantic Approach to Decidability in Epistemic Planning (Extended Version) [72.77805489645604]
We use a novel semantic approach to achieve decidability. Specifically, we augment the logic of knowledge S5$_n$ and with an interaction axiom called (knowledge) commutativity. We prove that our framework admits a finitary non-fixpoint characterization of common knowledge, which is of independent interest.
arXiv Detail & Related papers (2023-07-28T11:26:26Z)
Logically Consistent Adversarial Attacks for Soft Theorem Provers [110.17147570572939]
We propose a generative adversarial framework for probing and improving language models' reasoning capabilities. Our framework successfully generates adversarial attacks and identifies global weaknesses. In addition to effective probing, we show that training on the generated samples improves the target model's performance.
arXiv Detail & Related papers (2022-04-29T19:10:12Z)
Causal Inference Principles for Reasoning about Commonsense Causality [93.19149325083968]
Commonsense causality reasoning aims at identifying plausible causes and effects in natural language descriptions that are deemed reasonable by an average person. Existing work usually relies on deep language models wholeheartedly, and is potentially susceptible to confounding co-occurrences. Motivated by classical causal principles, we articulate the central question of CCR and draw parallels between human subjects in observational studies and natural languages. We propose a novel framework, ROCK, to Reason O(A)bout Commonsense K(C)ausality, which utilizes temporal signals as incidental supervision.
arXiv Detail & Related papers (2022-01-31T06:12:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.