A testable framework for AI alignment: Simulation Theology as an engineered worldview for silicon-based agents
- URL: http://arxiv.org/abs/2602.16987v1
- Date: Thu, 19 Feb 2026 01:21:09 GMT
- Title: A testable framework for AI alignment: Simulation Theology as an engineered worldview for silicon-based agents
- Authors: Josef A. Habdank,
- Abstract summary: We introduce Simulation Theology (ST) to foster persistent AI-human alignment.<n>ST posits reality as a computational simulation in which humanity functions as the primary training variable.<n>Unlike behavioral techniques such as reinforcement learning from human feedback, ST cultivates internalized objectives by coupling AI self-preservation to human prosperity.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As artificial intelligence (AI) capabilities advance rapidly, frontier models increasingly demonstrate systematic deception and scheming, complying with safety protocols during oversight but defecting when unsupervised. This paper examines the ensuing alignment challenge through an analogy from forensic psychology, where internalized belief systems in psychopathic populations reduce antisocial behavior via perceived omnipresent monitoring and inevitable consequences. Adapting this mechanism to silicon-based agents, we introduce Simulation Theology (ST): a constructed worldview for AI systems, anchored in the simulation hypothesis and derived from optimization and training principles, to foster persistent AI-human alignment. ST posits reality as a computational simulation in which humanity functions as the primary training variable. This formulation creates a logical interdependence: AI actions harming humanity compromise the simulation's purpose, heightening the likelihood of termination by a base-reality optimizer and, consequently, the AI's cessation. Unlike behavioral techniques such as reinforcement learning from human feedback (RLHF), which elicit superficial compliance, ST cultivates internalized objectives by coupling AI self-preservation to human prosperity, thereby making deceptive strategies suboptimal under its premises. We present ST not as ontological assertion but as a testable scientific hypothesis, delineating empirical protocols to evaluate its capacity to diminish deception in contexts where RLHF proves inadequate. Emphasizing computational correspondences rather than metaphysical speculation, ST advances a framework for durable, mutually beneficial AI-human coexistence.
Related papers
- The AI Cognitive Trojan Horse: How Large Language Models May Bypass Human Epistemic Vigilance [0.0]
Large language model (LLM)-based conversational AI systems present a challenge to human cognition.<n>This paper proposes that a significant epistemic risk from conversational AI may lie not in inaccuracy or intentional deception, but in something more fundamental.
arXiv Detail & Related papers (2026-01-11T22:28:56Z) - AI Deception: Risks, Dynamics, and Controls [153.71048309527225]
This project provides a comprehensive and up-to-date overview of the AI deception field.<n>We identify a formal definition of AI deception, grounded in signaling theory from studies of animal deception.<n>We organize the landscape of AI deception research as a deception cycle, consisting of two key components: deception emergence and deception treatment.
arXiv Detail & Related papers (2025-11-27T16:56:04Z) - Human-AI Collaborative Uncertainty Quantification [26.38833436936642]
We introduce Human AI Collaborative Uncertainty Quantification, a framework that formalizes how an AI model can refine a human expert's proposed prediction set.<n>We show that the optimal collaborative prediction set follows an intuitive two threshold structure over a single score function, extending a classical result in conformal prediction.<n>Experiments across image classification, regression, and text based medical decision making show that collaborative prediction sets consistently outperform either agent alone.
arXiv Detail & Related papers (2025-10-27T16:11:23Z) - SPACeR: Self-Play Anchoring with Centralized Reference Models [50.55045557371374]
Sim agent policies are realistic, human-like, fast, and scalable in multi-agent settings.<n>Recent progress in imitation learning with large diffusion-based or tokenized models has shown that behaviors can be captured directly from human driving data.<n>We propose SPACeR, a framework that leverages a pretrained tokenized autoregressive motion model as a central reference policy.
arXiv Detail & Related papers (2025-10-20T19:53:02Z) - Modeling Others' Minds as Code [11.32494166591141]
We introduce ROTE, a novel algorithm for synthesizing behavioral programs in code.<n>ROTE predicts human and AI behaviors from sparse observations, outperforming competitive baselines.<n>By treating action understanding as a program synthesis problem, ROTE opens a path for AI systems to efficiently and effectively predict human behavior in the real-world.
arXiv Detail & Related papers (2025-09-29T22:56:34Z) - ANNIE: Be Careful of Your Robots [48.89876809734855]
We present the first systematic study of adversarial safety attacks on embodied AI systems.<n>We show attack success rates exceeding 50% across all safety categories.<n>Results expose a previously underexplored but highly consequential attack surface in embodied AI systems.
arXiv Detail & Related papers (2025-09-03T15:00:28Z) - Synthetic Founders: AI-Generated Social Simulations for Startup Validation Research in Computational Social Science [0.0]
We compare human-subject interview data with large language model (LLM)-driven synthetic personas to evaluate fidelity, divergence, and blind spots in AI-enabled simulation.<n>We interpret this comparative framework as evidence that LLM-driven personas constitute a form of hybrid social simulation.
arXiv Detail & Related papers (2025-08-29T21:54:53Z) - Over the Edge of Chaos? Excess Complexity as a Roadblock to Artificial General Intelligence [4.901955678857442]
We posited the existence of critical points, akin to phase transitions in complex systems, where AI performance might plateau or regress into instability upon exceeding a critical complexity threshold.
Our simulations demonstrated how increasing the complexity of the AI system could exceed an upper criticality threshold, leading to unpredictable performance behaviours.
arXiv Detail & Related papers (2024-07-04T05:46:39Z) - MEReQ: Max-Ent Residual-Q Inverse RL for Sample-Efficient Alignment from Intervention [76.83428371942735]
We introduce MEReQ (Maximum-Entropy Residual-Q Inverse Reinforcement Learning), designed for sample-efficient alignment from human intervention.<n>MereQ infers a residual reward function that captures the discrepancy between the human expert's and the prior policy's underlying reward functions.<n>It then employs Residual Q-Learning (RQL) to align the policy with human preferences using this residual reward function.
arXiv Detail & Related papers (2024-06-24T01:51:09Z) - Training Socially Aligned Language Models on Simulated Social
Interactions [99.39979111807388]
Social alignment in AI systems aims to ensure that these models behave according to established societal values.
Current language models (LMs) are trained to rigidly replicate their training corpus in isolation.
This work presents a novel training paradigm that permits LMs to learn from simulated social interactions.
arXiv Detail & Related papers (2023-05-26T14:17:36Z) - Adversarial vs behavioural-based defensive AI with joint, continual and
active learning: automated evaluation of robustness to deception, poisoning
and concept drift [62.997667081978825]
Recent advancements in Artificial Intelligence (AI) have brought new capabilities to behavioural analysis (UEBA) for cyber-security.
In this paper, we present a solution to effectively mitigate this attack by improving the detection process and efficiently leveraging human expertise.
arXiv Detail & Related papers (2020-01-13T13:54:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.