Related papers: The Alignment Flywheel: A Governance-Centric Hybrid MAS for Architecture-Agnostic Safety

The Alignment Flywheel: A Governance-Centric Hybrid MAS for Architecture-Agnostic Safety

URL: http://arxiv.org/abs/2603.02259v1
Date: Sat, 28 Feb 2026 00:48:06 GMT
Title: The Alignment Flywheel: A Governance-Centric Hybrid MAS for Architecture-Agnostic Safety
Authors: Elias Malomgré, Pieter Simoens,
Abstract summary: This paper formalizes the Alignment Flywheel as a governance-centric hybrid MAS architecture.<n>An enforcement layer applies explicit risk policy at runtime, and a governance MAS supervises the Oracle through auditing, uncertainty-driven verification, and versioned refinement.<n>The architecture is implementation-agnostic with respect to both the Proposer and the Safety Oracle, and specifies the roles, artifacts, protocols, and release semantics needed for runtime gating, audit intake, signed patching, and staged rollout.
Score: 5.399984738447277
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-agent systems provide mature methodologies for role decomposition, coordination, and normative governance, capabilities that remain essential as increasingly powerful autonomous decision components are embedded within agent-based systems. While learned and generative models substantially expand system capability, their safety behavior is often entangled with training, making it opaque, difficult to audit, and costly to update after deployment. This paper formalizes the Alignment Flywheel as a governance-centric hybrid MAS architecture that decouples decision generation from safety governance. A Proposer, representing any autonomous decision component, generates candidate trajectories, while a Safety Oracle returns raw safety signals through a stable interface. An enforcement layer applies explicit risk policy at runtime, and a governance MAS supervises the Oracle through auditing, uncertainty-driven verification, and versioned refinement. The central engineering principle is patch locality: many newly observed safety failures can be mitigated by updating the governed oracle artifact and its release pipeline rather than retracting or retraining the underlying decision component. The architecture is implementation-agnostic with respect to both the Proposer and the Safety Oracle, and specifies the roles, artifacts, protocols, and release semantics needed for runtime gating, audit intake, signed patching, and staged rollout across distributed deployments. The result is a hybrid MAS engineering framework for integrating highly capable but fallible autonomous systems under explicit, version-controlled, and auditable oversight.

Related papers

Beyond Static Alignment: Hierarchical Policy Control for LLM Safety via Risk-Aware Chain-of-Thought [5.251527748612469]
Large Language Models (LLMs) face a fundamental safety-helpfulness trade-off due to static, one-size-fits-all safety policies.<n>We present textbfPACT (Prompt-Thought Action via Chain-of-Thought), a framework for dynamic safety control through explicit, risk-aware reasoning.
arXiv Detail & Related papers (2026-02-06T12:20:01Z)
PoSafeNet: Safe Learning with Poset-Structured Neural Nets [49.854863600271614]
existing approaches often enforce multiple safety constraints uniformly or via fixed priority orders, leading to infeasibility and brittle behavior.<n>We formalize this setting as poset-structured safety, modeling safety constraints as a partially ordered set and treating safety composition as a structural property of the policy class.<n>Building on this formulation, we propose PoSafeNet, a differentiable neural safety layer that enforces safety via sequential closed-form projection.
arXiv Detail & Related papers (2026-01-29T22:03:32Z)
Assured Autonomy: How Operations Research Powers and Orchestrates Generative AI Systems [18.881800772626427]
We argue generative models can be fragile in operational domains unless paired with mechanisms that provide feasibility, robustness to distribution shift, and stress testing.<n>We develop a conceptual framework for assured autonomy grounded in operations research.<n>These elements define a research agenda for assured autonomy in safety-critical, reliability-sensitive operational domains.
arXiv Detail & Related papers (2025-12-30T04:24:06Z)
From Linear Risk to Emergent Harm: Complexity as the Missing Core of AI Governance [0.0]
Risk-based AI regulation promises proportional controls aligned with anticipated harms.<n>This paper argues that such frameworks often fail for structural reasons.<n>We propose a complexity-based framework for AI governance that treats regulation as intervention rather than control.
arXiv Detail & Related papers (2025-12-14T14:19:21Z)
Robust Verification of Controllers under State Uncertainty via Hamilton-Jacobi Reachability Analysis [49.31947916567367]
Hamilton-Jacobi (J) reachability analysis is a popular formal verification tool for general nonlinear systems that can compute optimal reachable under worst-case uncertainties.<n>This work is the first HJ-based reachability-based system verification framework for the Robust Verification Controllers via HJ rover.<n>Within Ro-CoRe, we propose novel methods for safety verification and controller design.
arXiv Detail & Related papers (2025-11-18T18:55:20Z)
OS-Sentinel: Towards Safety-Enhanced Mobile GUI Agents via Hybrid Validation in Realistic Workflows [77.95511352806261]
Computer-using agents powered by Vision-Language Models (VLMs) have demonstrated human-like capabilities in operating digital environments like mobile platforms.<n>We propose OS-Sentinel, a novel hybrid safety detection framework that combines a Formal Verifier for detecting explicit system-level violations with a Contextual Judge for assessing contextual risks and agent actions.
arXiv Detail & Related papers (2025-10-28T13:22:39Z)
Policy-as-Prompt: Turning AI Governance Rules into Guardrails for AI Agents [0.19336815376402716]
We introduce a regulatory machine learning framework that converts unstructured design artifacts (like PRDs, TDDs, and code) into verifiable runtime guardrails.<n>Our Policy as Prompt method reads these documents and risk controls to build a source-linked policy tree.<n>System is built to enforce least privilege and data minimization.
arXiv Detail & Related papers (2025-09-28T17:36:52Z)
DRIFT: Dynamic Rule-Based Defense with Injection Isolation for Securing LLM Agents [52.92354372596197]
Large Language Models (LLMs) are increasingly central to agentic systems due to their strong reasoning and planning capabilities.<n>This interaction also introduces the risk of prompt injection attacks, where malicious inputs from external sources can mislead the agent's behavior.<n>We propose a Dynamic Rule-based Isolation Framework for Trustworthy agentic systems, which enforces both control and data-level constraints.
arXiv Detail & Related papers (2025-06-13T05:01:09Z)
Recursively Feasible Probabilistic Safe Online Learning with Control Barrier Functions [60.26921219698514]
We introduce a model-uncertainty-aware reformulation of CBF-based safety-critical controllers. We then present the pointwise feasibility conditions of the resulting safety controller. We use these conditions to devise an event-triggered online data collection strategy.
arXiv Detail & Related papers (2022-08-23T05:02:09Z)
Joint Differentiable Optimization and Verification for Certified Reinforcement Learning [91.93635157885055]
In model-based reinforcement learning for safety-critical control systems, it is important to formally certify system properties. We propose a framework that jointly conducts reinforcement learning and formal verification.
arXiv Detail & Related papers (2022-01-28T16:53:56Z)
Runtime Safety Assurance Using Reinforcement Learning [37.61747231296097]
This paper aims to design a meta-controller capable of identifying unsafe situations with high accuracy. We frame the design of RTSA with the Markov decision process (MDP) and use reinforcement learning (RL) to solve it.
arXiv Detail & Related papers (2020-10-20T20:54:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.