A Framework for Inherently Safer AGI through Language-Mediated Active Inference
- URL: http://arxiv.org/abs/2508.05766v1
- Date: Thu, 07 Aug 2025 18:28:54 GMT
- Title: A Framework for Inherently Safer AGI through Language-Mediated Active Inference
- Authors: Bo Wen,
- Abstract summary: This paper proposes a novel framework for developing safe Artificial General Intelligence (AGI) by combining Active Inference principles with Large Language Models (LLMs)<n>We present an architecture where safety guarantees are integrated into the system's core design through transparent belief representations and hierarchical value alignment.<n>The architecture implements a multi-agent system where agents self-organize according to Active Inference principles, with preferences and safety constraints flowing through hierarchical Markov blankets.
- Score: 1.9761774213809036
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This paper proposes a novel framework for developing safe Artificial General Intelligence (AGI) by combining Active Inference principles with Large Language Models (LLMs). We argue that traditional approaches to AI safety, focused on post-hoc interpretability and reward engineering, have fundamental limitations. We present an architecture where safety guarantees are integrated into the system's core design through transparent belief representations and hierarchical value alignment. Our framework leverages natural language as a medium for representing and manipulating beliefs, enabling direct human oversight while maintaining computational tractability. The architecture implements a multi-agent system where agents self-organize according to Active Inference principles, with preferences and safety constraints flowing through hierarchical Markov blankets. We outline specific mechanisms for ensuring safety, including: (1) explicit separation of beliefs and preferences in natural language, (2) bounded rationality through resource-aware free energy minimization, and (3) compositional safety through modular agent structures. The paper concludes with a research agenda centered on the Abstraction and Reasoning Corpus (ARC) benchmark, proposing experiments to validate our framework's safety properties. Our approach offers a path toward AGI development that is inherently safer, rather than retrofitted with safety measures.
Related papers
- Extending the Formalism and Theoretical Foundations of Cryptography to AI [18.724847875398435]
Recent progress in (Large) Language Models has enabled the development of autonomous LM-based agents.<n>One emerging direction to mitigate security risks is to constrain agent behaviours via access control and permissioning mechanisms.<n>We first systematize the landscape by constructing an attack taxonomy tailored to language models.<n>We then develop a formal treatment of agentic access control by defining an AIOracle algorithmically and introducing a security-game framework.
arXiv Detail & Related papers (2026-03-03T04:11:21Z) - Contextual Safety Reasoning and Grounding for Open-World Robots [79.98924225712668]
CORE is a safety framework that enables online contextual reasoning, grounding, and enforcement without prior knowledge of the environment.<n>We provide probabilistic safety guarantees for CORE that account for perceptual uncertainty.<n>We demonstrate through simulation and real-world experiments that CORE enforces contextually appropriate behavior in unseen environments.
arXiv Detail & Related papers (2026-02-23T15:51:23Z) - The Path Ahead for Agentic AI: Challenges and Opportunities [4.52683540940001]
This chapter examines the emergence of agentic AI systems that operate autonomously in complex environments.<n>We trace the architectural progression from statistical models to transformer-based systems, identifying capabilities that enable agentic behavior.<n>Unlike existing surveys, we focus on the architectural transition from language understanding to autonomous action, emphasizing the technical gaps that must be resolved before deployment.
arXiv Detail & Related papers (2026-01-06T06:31:42Z) - SoK: Trust-Authorization Mismatch in LLM Agent Interactions [16.633676842555044]
Large Language Models (LLMs) are rapidly evolving into autonomous agents capable of interacting with the external world.<n>This paper provides a unifying formal lens for agent-interaction security.<n>We introduce a novel risk analysis model centered on the trust-authorization gap.
arXiv Detail & Related papers (2025-12-07T16:41:02Z) - SafeGRPO: Self-Rewarded Multimodal Safety Alignment via Rule-Governed Policy Optimization [79.14563283347773]
Multimodal large language models (MLLMs) have demonstrated impressive reasoning and instruction-following capabilities.<n>Cross-modal couplings can produce unsafe semantics even when individual inputs are benign.<n>We propose SafeGRPO, a self-rewarded multimodal safety alignment framework.
arXiv Detail & Related papers (2025-11-17T05:09:49Z) - Countermind: A Multi-Layered Security Architecture for Large Language Models [0.0]
This paper proposes Countermind, a multi-layered security architecture intended to shift defenses from a reactive, post hoc posture to a proactive, pre-inference, and intra-inference enforcement model.<n>The architecture proposes a fortified perimeter designed to structurally validate and transform all inputs, and an internal governance mechanism intended to constrain the model's semantic processing pathways before an output is generated.
arXiv Detail & Related papers (2025-10-13T18:41:18Z) - AURA: Affordance-Understanding and Risk-aware Alignment Technique for Large Language Models [6.059681491089391]
AURA provides comprehensive, step level evaluations across logical coherence and safety-awareness.<n>Our framework seamlessly combines introspective self-critique, fine-grained PRM assessments, and adaptive safety-aware decoding.<n>This research represents a pivotal step toward safer, more responsible, and contextually aware AI, setting a new benchmark for alignment-sensitive applications.
arXiv Detail & Related papers (2025-08-08T08:43:24Z) - LLM Agents Should Employ Security Principles [60.03651084139836]
This paper argues that the well-established design principles in information security should be employed when deploying Large Language Model (LLM) agents at scale.<n>We introduce AgentSandbox, a conceptual framework embedding these security principles to provide safeguards throughout an agent's life-cycle.
arXiv Detail & Related papers (2025-05-29T21:39:08Z) - Emotion-Gradient Metacognitive RSI (Part I): Theoretical Foundations and Single-Agent Architecture [0.0]
We present the Emotion-Gradient Metacognitive Recursive Self-Improvement (EG-MRSI) framework, a novel architecture that integrates introspective metacognition and emotion-based intrinsic motivation.<n>The framework is explicitly capable of overwriting its own learning algorithm under formally bounded risk.
arXiv Detail & Related papers (2025-05-12T17:02:47Z) - CEE: An Inference-Time Jailbreak Defense for Embodied Intelligence via Subspace Concept Rotation [23.07221882519171]
Large Language Models (LLMs) are increasingly becoming the cognitive core of Embodied Intelligence (EI) systems.<n>We propose a novel and efficient inference-time defense framework: Concept Enhancement Engineering (CEE)<n>CEE enhances the model's inherent safety mechanisms by directly manipulating its internal representations.
arXiv Detail & Related papers (2025-04-15T03:50:04Z) - Towards AI-$45^{\circ}$ Law: A Roadmap to Trustworthy AGI [24.414787444128947]
We propose the textitAI-textbf$45circ$ Law as a guiding principle for a balanced roadmap toward trustworthy AGI.<n>This framework provides a systematic taxonomy and hierarchical structure for current AI capability and safety research, inspired by Judea Pearl's Ladder of Causation''
arXiv Detail & Related papers (2024-12-08T14:14:16Z) - SafetyAnalyst: Interpretable, Transparent, and Steerable Safety Moderation for AI Behavior [56.10557932893919]
We present SafetyAnalyst, a novel AI safety moderation framework.<n>Given an AI behavior, SafetyAnalyst uses chain-of-thought reasoning to analyze its potential consequences.<n>It aggregates effects into a harmfulness score using 28 fully interpretable weight parameters.
arXiv Detail & Related papers (2024-10-22T03:38:37Z) - Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems [88.80306881112313]
We will introduce and define a family of approaches to AI safety, which we will refer to as guaranteed safe (GS) AI.
The core feature of these approaches is that they aim to produce AI systems which are equipped with high-assurance quantitative safety guarantees.
We outline a number of approaches for creating each of these three core components, describe the main technical challenges, and suggest a number of potential solutions to them.
arXiv Detail & Related papers (2024-05-10T17:38:32Z) - Evaluating Model-free Reinforcement Learning toward Safety-critical
Tasks [70.76757529955577]
This paper revisits prior work in this scope from the perspective of state-wise safe RL.
We propose Unrolling Safety Layer (USL), a joint method that combines safety optimization and safety projection.
To facilitate further research in this area, we reproduce related algorithms in a unified pipeline and incorporate them into SafeRL-Kit.
arXiv Detail & Related papers (2022-12-12T06:30:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.