Related papers: The Concept of Criticality in AI Safety

Related papers

The Oversight Game: Learning to Cooperatively Balance an AI Agent's Safety and Autonomy [9.553819152637493]
We study a minimal control interface where an agent chooses whether to act autonomously (play) or defer (ask)<n>If the agent defers, the human's choice determines the outcome, potentially leading to a corrective action or a system shutdown.<n>Our analysis focuses on cases where this game qualifies as a Markov Potential Game (MPG), a class of games where we can provide an alignment guarantee.
arXiv Detail & Related papers (2025-10-30T17:46:49Z)
ManagerBench: Evaluating the Safety-Pragmatism Trade-off in Autonomous LLMs [48.50397204177239]
As large language models (LLMs) evolve, evaluating the safety of their actions becomes critical.<n>We introduce ManagerBench, a benchmark that evaluates LLM decision-making in realistic, human-validated managerial scenarios.<n>A parallel control set, where potential harm is directed only at inanimate objects, measures a model's pragmatism and identifies its tendency to be overly safe.
arXiv Detail & Related papers (2025-10-01T13:08:33Z)
Oyster-I: Beyond Refusal -- Constructive Safety Alignment for Responsible Language Models [93.5740266114488]
Constructive Safety Alignment (CSA) protects against malicious misuse while actively guiding vulnerable users toward safe and helpful results.<n>Oy1 achieves state-of-the-art safety among open models while retaining high general capabilities.<n>We release Oy1, code, and the benchmark to support responsible, user-centered AI.
arXiv Detail & Related papers (2025-09-02T03:04:27Z)
The Limits of Predicting Agents from Behaviour [16.80911584745046]
We provide a precise answer under the assumption that the agent's behaviour is guided by a world model.<n>Our contribution is the derivation of novel bounds on the agent's behaviour in new (unseen) deployment environments.<n>We discuss the implications of these results for several research areas including fairness and safety.
arXiv Detail & Related papers (2025-06-03T14:24:58Z)
AgentDAM: Privacy Leakage Evaluation for Autonomous Web Agents [75.85554113398626]
We develop a benchmark called AgentDAM to evaluate how well existing and future AI agents can limit processing of potentially private information. Our benchmark simulates realistic web interaction scenarios and is adaptable to all existing web navigation agents.
arXiv Detail & Related papers (2025-03-12T19:30:31Z)
Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path? [37.13209023718946]
Unchecked AI agency poses significant risks to public safety and security. We discuss how these risks arise from current AI training methods. We propose a core building block for further advances the development of a non-agentic AI system.
arXiv Detail & Related papers (2025-02-21T18:28:36Z)
Fully Autonomous AI Agents Should Not be Developed [58.88624302082713]
This paper argues that fully autonomous AI agents should not be developed. In support of this position, we build from prior scientific literature and current product marketing to delineate different AI agent levels. Our analysis reveals that risks to people increase with the autonomy of a system.
arXiv Detail & Related papers (2025-02-04T19:00:06Z)
YETI (YET to Intervene) Proactive Interventions by Multimodal AI Agents in Augmented Reality Tasks [16.443149180969776]
Augmented Reality (AR) head worn devices can uniquely improve the user experience of solving procedural day-to-day tasks. Such AR capabilities can help AI Agents see and listen to actions that users take which can relate to multimodal capabilities of human users. Proactivity of AI Agents on the other hand can help the human user detect and correct any mistakes in agent observed tasks.
arXiv Detail & Related papers (2025-01-16T08:06:02Z)
Engineering Trustworthy AI: A Developer Guide for Empirical Risk Minimization [53.80919781981027]
Key requirements for trustworthy AI can be translated into design choices for the components of empirical risk minimization. We hope to provide actionable guidance for building AI systems that meet emerging standards for trustworthiness of AI.
arXiv Detail & Related papers (2024-10-25T07:53:32Z)
Risk Alignment in Agentic AI Systems [0.0]
Agentic AIs capable of undertaking complex actions with little supervision raise new questions about how to safely create and align such systems with users, developers, and society. Risk alignment will matter for user satisfaction and trust, but it will also have important ramifications for society more broadly. We present three papers that bear on key normative and technical aspects of these questions.
arXiv Detail & Related papers (2024-10-02T18:21:08Z)
Criticality and Safety Margins for Reinforcement Learning [53.10194953873209]
We seek to define a criticality framework with both a quantifiable ground truth and a clear significance to users. We introduce true criticality as the expected drop in reward when an agent deviates from its policy for n consecutive random actions. We also introduce the concept of proxy criticality, a low-overhead metric that has a statistically monotonic relationship to true criticality.
arXiv Detail & Related papers (2024-09-26T21:00:45Z)
Enhancing Trust in Autonomous Agents: An Architecture for Accountability and Explainability through Blockchain and Large Language Models [0.3495246564946556]
This work presents an accountability and explainability architecture implemented for ROS-based mobile robots. The proposed solution consists of two main components. Firstly, a black box-like element to provide accountability, featuring anti-tampering properties achieved through blockchain technology. Secondly, a component in charge of generating natural language explanations by harnessing the capabilities of Large Language Models (LLMs) over the data contained within the previously mentioned black box.
arXiv Detail & Related papers (2024-03-14T16:57:18Z)
PsySafe: A Comprehensive Framework for Psychological-based Attack, Defense, and Evaluation of Multi-agent System Safety [70.84902425123406]
Multi-agent systems, when enhanced with Large Language Models (LLMs), exhibit profound capabilities in collective intelligence. However, the potential misuse of this intelligence for malicious purposes presents significant risks. We propose a framework (PsySafe) grounded in agent psychology, focusing on identifying how dark personality traits in agents can lead to risky behaviors. Our experiments reveal several intriguing phenomena, such as the collective dangerous behaviors among agents, agents' self-reflection when engaging in dangerous behavior, and the correlation between agents' psychological assessments and dangerous behaviors.
arXiv Detail & Related papers (2024-01-22T12:11:55Z)
What's my role? Modelling responsibility for AI-based safety-critical systems [1.0549609328807565]
It is difficult for developers and manufacturers to be held responsible for harmful behaviour of an AI-SCS. A human operator can become a "liability sink" absorbing blame for the consequences of AI-SCS outputs they weren't responsible for creating. This paper considers different senses of responsibility (role, moral, legal and causal), and how they apply in the context of AI-SCS safety.
arXiv Detail & Related papers (2023-12-30T13:45:36Z)
Safety Margins for Reinforcement Learning [53.10194953873209]
We show how to leverage proxy criticality metrics to generate safety margins. We evaluate our approach on learned policies from APE-X and A3C within an Atari environment.
arXiv Detail & Related papers (2023-07-25T16:49:54Z)
Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark [61.43264961005614]
We develop a benchmark of 134 Choose-Your-Own-Adventure games containing over half a million rich, diverse scenarios. We evaluate agents' tendencies to be power-seeking, cause disutility, and commit ethical violations. Our results show that agents can both act competently and morally, so concrete progress can be made in machine ethics.
arXiv Detail & Related papers (2023-04-06T17:59:03Z)
On Avoiding Power-Seeking by Artificial Intelligence [93.9264437334683]
We do not know how to align a very intelligent AI agent's behavior with human interests. I investigate whether we can build smart AI agents which have limited impact on the world, and which do not autonomously seek power.
arXiv Detail & Related papers (2022-06-23T16:56:21Z)
Balancing Performance and Human Autonomy with Implicit Guidance Agent [8.071506311915396]
We show that implicit guidance is effective for enabling humans to maintain a balance between improving their plans and retaining autonomy. We modeled a collaborative agent with implicit guidance by integrating the Bayesian Theory of Mind into existing collaborative-planning algorithms.
arXiv Detail & Related papers (2021-09-01T14:47:29Z)
Mitigating Negative Side Effects via Environment Shaping [27.400267388362654]
Agents operating in unstructured environments often produce negative side effects (NSE) We present an algorithm to solve this problem and analyze its theoretical properties. Empirical evaluation of our approach shows that the proposed framework can successfully mitigate NSE, without affecting the agent's ability to complete its assigned task.
arXiv Detail & Related papers (2021-02-13T22:15:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.