Related papers: Personalized Constitutionally-Aligned Agentic Superego: Secure AI Behavior Aligned to Diverse Human Values

Personalized Constitutionally-Aligned Agentic Superego: Secure AI Behavior Aligned to Diverse Human Values

URL: http://arxiv.org/abs/2506.13774v2
Date: Fri, 08 Aug 2025 20:29:52 GMT
Title: Personalized Constitutionally-Aligned Agentic Superego: Secure AI Behavior Aligned to Diverse Human Values
Authors: Nell Watson, Ahmed Amer, Evan Harris, Preeti Ravindra, Shujun Zhang,
Abstract summary: Superego agent steers AI planning by referencing user-selected 'Creed Constitutions'<n>A real-time compliance enforcer validates plans against these constitutions.<n>System achieves up to a 98.3% harm score reduction and near-perfect refusal rates.
Score: 0.6640968473398455
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Agentic AI systems, possessing capabilities for autonomous planning and action, show great potential across diverse domains. However, their practical deployment is hindered by challenges in aligning their behavior with varied human values, complex safety requirements, and specific compliance needs. Existing alignment methodologies often falter when faced with the complex task of providing personalized context without inducing confabulation or operational inefficiencies. This paper introduces a novel solution: a 'superego' agent, designed as a personalized oversight mechanism for agentic AI. This system dynamically steers AI planning by referencing user-selected 'Creed Constitutions' encapsulating diverse rule sets -- with adjustable adherence levels to fit non-negotiable values. A real-time compliance enforcer validates plans against these constitutions and a universal ethical floor before execution. We present a functional system, including a demonstration interface with a prototypical constitution-sharing portal, and successful integration with third-party models via the Model Context Protocol (MCP). Comprehensive benchmark evaluations (HarmBench, AgentHarm) demonstrate that our Superego agent dramatically reduces harmful outputs -- achieving up to a 98.3% harm score reduction and near-perfect refusal rates (e.g., 100% with Claude Sonnet 4 on AgentHarm's harmful set) for leading LLMs like Gemini 2.5 Flash and GPT-4o. This approach substantially simplifies personalized AI alignment, rendering agentic systems more reliably attuned to individual and cultural contexts, while also enabling substantial safety improvements. An overview on this research with examples is available at https://superego.creed.space.

Related papers

Towards Efficient Agents: A Co-Design of Inference Architecture and System [66.59916327634639]
This paper presents AgentInfer, a unified framework for end-to-end agent acceleration.<n>We decompose the problem into four synergistic components: AgentCollab, AgentSched, AgentSAM, and AgentCompress.<n>Experiments on the BrowseComp-zh and DeepDiver benchmarks demonstrate that through the synergistic collaboration of these methods, AgentInfer reduces ineffective token consumption by over 50%.
arXiv Detail & Related papers (2025-12-20T12:06:13Z)
Towards a Science of Scaling Agent Systems [79.64446272302287]
We formalize a definition for agent evaluation and characterize scaling laws as the interplay between agent quantity, coordination structure, modelic, and task properties.<n>We derive a predictive model using coordination metrics, that cross-validated R2=0, enabling prediction on unseen task domains.<n>We identify three effects: (1) a tool-coordination trade-off: under fixed computational budgets, tool-heavy tasks suffer disproportionately from multi-agent overhead, and (2) a capability saturation: coordination yields diminishing or negative returns once single-agent baselines exceed 45%.
arXiv Detail & Related papers (2025-12-09T06:52:21Z)
Towards a Humanized Social-Media Ecosystem: AI-Augmented HCI Design Patterns for Safety, Agency & Well-Being [0.0]
Social platforms connect billions of people, yet their engagement-first algorithms often work on users rather than with them.<n>We propose Human-Layer AI (HL-AI)--user-owned, explainable intermediaries that sit in the browser between platform logic and the interface.<n>HL-AI gives people practical, moment-to-moment control without requiring platform cooperation.
arXiv Detail & Related papers (2025-11-08T06:22:15Z)
Impatient Users Confuse AI Agents: High-fidelity Simulations of Human Traits for Testing Agents [58.00130492861884]
TraitBasis is a lightweight, model-agnostic method for systematically stress testing AI agents.<n>TraitBasis learns directions in activation space corresponding to steerable user traits.<n>We observe on average a 2%-30% performance degradation on $tau$-Trait across frontier models.
arXiv Detail & Related papers (2025-10-06T05:03:57Z)
FedStrategist: A Meta-Learning Framework for Adaptive and Robust Aggregation in Federated Learning [0.10241134756773229]
Federated Learning (FL) offers a paradigm for privacy-preserving collaborative AI, but its decentralized nature creates significant vulnerabilities to model poisoning attacks.<n>This paper introduces FedStrategist, a novel meta-learning framework that reframes robust aggregation as a real-time, cost-aware control problem.
arXiv Detail & Related papers (2025-07-18T18:53:26Z)
OpenAgentSafety: A Comprehensive Framework for Evaluating Real-World AI Agent Safety [58.201189860217724]
We introduce OpenAgentSafety, a comprehensive framework for evaluating agent behavior across eight critical risk categories.<n>Unlike prior work, our framework evaluates agents that interact with real tools, including web browsers, code execution environments, file systems, bash shells, and messaging platforms.<n>It combines rule-based analysis with LLM-as-judge assessments to detect both overt and subtle unsafe behaviors.
arXiv Detail & Related papers (2025-07-08T16:18:54Z)
DRIFT: Dynamic Rule-Based Defense with Injection Isolation for Securing LLM Agents [52.92354372596197]
Large Language Models (LLMs) are increasingly central to agentic systems due to their strong reasoning and planning capabilities.<n>This interaction also introduces the risk of prompt injection attacks, where malicious inputs from external sources can mislead the agent's behavior.<n>We propose a Dynamic Rule-based Isolation Framework for Trustworthy agentic systems, which enforces both control and data-level constraints.
arXiv Detail & Related papers (2025-06-13T05:01:09Z)
Scalable, Symbiotic, AI and Non-AI Agent Based Parallel Discrete Event Simulations [0.0]
This paper introduces a novel parallel discrete event simulation (PDES) based methodology to combine multiple AI and non-AI agents.<n>We evaluate our approach by solving four problems from four different domains and comparing the results with those from AI models alone.<n>Results show that overall accuracy of our approach is 68% where as the accuracy of vanilla models is less than 23%.
arXiv Detail & Related papers (2025-05-28T17:50:01Z)
AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security [74.22452069013289]
AegisLLM is a cooperative multi-agent defense against adversarial attacks and information leakage.<n>We show that scaling agentic reasoning system at test-time substantially enhances robustness without compromising model utility.<n> Comprehensive evaluations across key threat scenarios, including unlearning and jailbreaking, demonstrate the effectiveness of AegisLLM.
arXiv Detail & Related papers (2025-04-29T17:36:05Z)
AgentDAM: Privacy Leakage Evaluation for Autonomous Web Agents [75.85554113398626]
We introduce a new benchmark AgentDAM that measures if AI web-navigation agents follow the privacy principle of data minimization''<n>Our benchmark simulates realistic web interaction scenarios end-to-end and is adaptable to all existing web navigation agents.
arXiv Detail & Related papers (2025-03-12T19:30:31Z)
Safe Explicable Policy Search [3.3869539907606603]
We present Safe Explicable Policy Search (SEPS), which aims to provide a learning approach to explicable behavior generation while minimizing the safety risk.<n>We formulate SEPS as a constrained optimization problem where the agent aims to maximize an explicability score subject to constraints on safety.<n>We evaluate SEPS in safety-gym environments and with a physical robot experiment to show that it can learn explicable behaviors that adhere to the agent's safety requirements and are efficient.
arXiv Detail & Related papers (2025-03-10T20:52:41Z)
STAMP: Scalable Task And Model-agnostic Collaborative Perception [24.890993164334766]
STAMP is a task- and model-agnostic, collaborative perception pipeline for heterogeneous agents.<n>It minimizes computational overhead, enhances scalability, and preserves model security.<n>As a first-of-its-kind framework, STAMP aims to advance research in scalable and secure mobility systems towards Level 5 autonomy.
arXiv Detail & Related papers (2025-01-24T16:27:28Z)
IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems [2.2810745411557316]
We introduce IntellAgent, a scalable, open-source framework to evaluate conversational AI systems.<n>IntellAgent automates the creation of synthetic benchmarks by combining policy-driven graph modeling, realistic event generation, and interactive user-agent simulations.<n>Our findings demonstrate that IntellAgent serves as an effective framework for advancing conversational AI by addressing challenges in bridging research and deployment.
arXiv Detail & Related papers (2025-01-19T14:58:35Z)
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents [44.34340798542]
Large Language Models (LLMs) have shown remarkable capabilities in natural language tasks requiring complex reasoning. Traditional supervised pre-training on static datasets falls short in enabling autonomous agent capabilities. We propose a framework that combines guided Monte Carlo Tree Search (MCTS) search with a self-critique mechanism and iterative fine-tuning on agent interactions.
arXiv Detail & Related papers (2024-08-13T20:52:13Z)
On Generative Agents in Recommendation [58.42840923200071]
Agent4Rec is a user simulator in recommendation based on Large Language Models. Each agent interacts with personalized recommender models in a page-by-page manner.
arXiv Detail & Related papers (2023-10-16T06:41:16Z)
ASSERT: Automated Safety Scenario Red Teaming for Evaluating the Robustness of Large Language Models [65.79770974145983]
ASSERT, Automated Safety Scenario Red Teaming, consists of three methods -- semantically aligned augmentation, target bootstrapping, and adversarial knowledge injection. We partition our prompts into four safety domains for a fine-grained analysis of how the domain affects model performance. We find statistically significant performance differences of up to 11% in absolute classification accuracy among semantically related scenarios and error rates of up to 19% absolute error in zero-shot adversarial settings.
arXiv Detail & Related papers (2023-10-14T17:10:28Z)
LAMBO: Large AI Model Empowered Edge Intelligence [71.56135386994119]
Next-generation edge intelligence is anticipated to benefit various applications via offloading techniques. Traditional offloading architectures face several issues, including heterogeneous constraints, partial perception, uncertain generalization, and lack of tractability. We propose a Large AI Model-Based Offloading (LAMBO) framework with over one billion parameters for solving these problems.
arXiv Detail & Related papers (2023-08-29T07:25:42Z)
Model-based Reinforcement Learning for Decentralized Multiagent Rendezvous [66.6895109554163]
Underlying the human ability to align goals with other agents is their ability to predict the intentions of others and actively update their own plans. We propose hierarchical predictive planning (HPP), a model-based reinforcement learning method for decentralized multiagent rendezvous.
arXiv Detail & Related papers (2020-03-15T19:49:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.