Related papers: Evaluating LLM Agent Adherence to Hierarchical Safety Principles: A Lightweight Benchmark for Probing Foundational Controllability Components

Related papers

SAGE-LLM: Towards Safe and Generalizable LLM Controller with Fuzzy-CBF Verification and Graph-Structured Knowledge Retrieval for UAV Decision [46.089736018739295]
Large Language Models (LLM) lack domain-specific UAV control knowledge and formal safety assurances.<n>This paper proposes a train-free two-layer decision architecture based on LLMs, integrating high-level safety planning with low-level precise control.
arXiv Detail & Related papers (2026-02-27T06:41:04Z)
Evaluating Implicit Regulatory Compliance in LLM Tool Invocation via Logic-Guided Synthesis [18.51135049856393]
We introduce LogiSafetyGen, a framework that converts unstructured regulations into Linear Temporal Logic oracles and employs logic-guided fuzzing to synthesize valid, safety-critical traces.<n>Building on this framework, we construct LogiSafetyBench, a benchmark comprising 240 human-verified tasks that require LLMs to generate Python programs that satisfy both functional objectives and latent compliance rules.<n> Evaluations of 13 state-of-the-art (SOTA) LLMs reveal that larger models, despite achieving better functional correctness, frequently prioritize task completion over safety, which results in non-compliant behavior.
arXiv Detail & Related papers (2026-01-13T03:55:18Z)
Reasoning over Precedents Alongside Statutes: Case-Augmented Deliberative Alignment for LLM Safety [59.01189713115365]
We evaluate the impact of explicitly specifying extensive safety codes versus demonstrating them through illustrative cases.<n>We find that referencing explicit codes inconsistently improves harmlessness and systematically degrades helpfulness.<n>We propose CADA, a case-augmented deliberative alignment method for LLMs utilizing reinforcement learning on self-generated safety reasoning chains.
arXiv Detail & Related papers (2026-01-12T21:08:46Z)
Verifiability-First Agents: Provable Observability and Lightweight Audit Agents for Controlling Autonomous LLM Systems [0.0]
We propose a Verifiability-First architecture that integrates run-time attestations of agent actions using cryptographic and symbolic methods.<n>We also embed Audit Agents that continuously verify intent versus behavior using constrained reasoning.<n>Our approach shifts the evaluation focus from how likely misalignment is to how quickly and reliably misalignment can be detected and remediated.
arXiv Detail & Related papers (2025-12-19T06:12:43Z)
Measuring What Matters: A Framework for Evaluating Safety Risks in Real-World LLM Applications [0.0]
This paper introduces a practical framework for evaluating application-level safety in large language models (LLMs)<n>We illustrate how the proposed framework was applied in our internal pilot, providing a reference point for organizations seeking to scale their safety testing efforts.
arXiv Detail & Related papers (2025-07-13T22:34:20Z)
AgentAuditor: Human-Level Safety and Security Evaluation for LLM Agents [41.000042817113645]
sys is a universal, training-free, memory-augmented reasoning framework.<n>sys constructs an experiential memory by having an LLM adaptively extract structured semantic features.<n>data is the first benchmark designed to check how well LLM-based evaluators can spot both safety risks and security threats.
arXiv Detail & Related papers (2025-05-31T17:10:23Z)
LLM Agents Should Employ Security Principles [60.03651084139836]
This paper argues that the well-established design principles in information security should be employed when deploying Large Language Model (LLM) agents at scale.<n>We introduce AgentSandbox, a conceptual framework embedding these security principles to provide safeguards throughout an agent's life-cycle.
arXiv Detail & Related papers (2025-05-29T21:39:08Z)
Subtle Risks, Critical Failures: A Framework for Diagnosing Physical Safety of LLMs for Embodied Decision Making [31.555271917529872]
We introduce SAFEL, the framework for systematically evaluating the physical safety of large language models (LLMs) in embodied decision making.<n>We introduce EMBODYGUARD, a PDDL-grounded benchmark containing 942 LLM-generated scenarios covering both overtly malicious and contextually hazardous instructions.<n>Our results highlight critical limitations in current LLMs and provide a foundation for more targeted, modular improvements in safe embodied reasoning.
arXiv Detail & Related papers (2025-05-26T13:01:14Z)
Safety by Measurement: A Systematic Literature Review of AI Safety Evaluation Methods [0.0]
This literature review consolidates the rapidly evolving field of AI safety evaluations.<n>It proposes a systematic taxonomy around three dimensions: what properties we measure, how we measure them, and how these measurements integrate into frameworks.
arXiv Detail & Related papers (2025-05-08T16:55:07Z)
Inherent and emergent liability issues in LLM-based agentic systems: a principal-agent perspective [0.0]
Agentic systems powered by large language models (LLMs) are becoming progressively more complex and capable.<n>Their increasing agency and expanding deployment settings attract growing attention over effective governance policies, monitoring and control protocols.<n>We analyze the potential liability issues stemming from delegated use of LLM agents and their extended systems from a principal-agent perspective.
arXiv Detail & Related papers (2025-04-04T08:10:02Z)
SAFER: Advancing Safety Alignment via Efficient Ex-Ante Reasoning [51.78514648677898]
We propose SAFER, a framework for Safety Alignment via eFficient Ex-Ante Reasoning.<n>Our approach instantiates structured Ex-Ante reasoning through initial assessment, rule verification, and path calibration.<n> Experiments on multiple open-source LLMs demonstrate that SAFER significantly enhances safety performance while maintaining helpfulness and response efficiency.
arXiv Detail & Related papers (2025-04-03T16:07:38Z)
Standard Benchmarks Fail - Auditing LLM Agents in Finance Must Prioritize Risk [31.43947127076459]
Standard benchmarks fixate on how well large language model (LLM) agents perform in finance, yet say little about whether they are safe to deploy.<n>We argue that accuracy metrics and return-based scores provide an illusion of reliability, overlooking vulnerabilities such as hallucinated facts, stale data, and adversarial prompt manipulation.
arXiv Detail & Related papers (2025-02-21T12:56:15Z)
PredictaBoard: Benchmarking LLM Score Predictability [50.47497036981544]
Large Language Models (LLMs) often fail unpredictably.<n>This poses a significant challenge to ensuring their safe deployment.<n>We present PredictaBoard, a novel collaborative benchmarking framework.
arXiv Detail & Related papers (2025-02-20T10:52:38Z)
Agent-SafetyBench: Evaluating the Safety of LLM Agents [72.92604341646691]
We introduce Agent-SafetyBench, a benchmark designed to evaluate the safety of large language models (LLMs)<n>Agent-SafetyBench encompasses 349 interaction environments and 2,000 test cases, evaluating 8 categories of safety risks and covering 10 common failure modes frequently encountered in unsafe interactions.<n>Our evaluation of 16 popular LLM agents reveals a concerning result: none of the agents achieves a safety score above 60%.
arXiv Detail & Related papers (2024-12-19T02:35:15Z)
SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents [42.69984822098671]
Existing benchmarks predominantly overlook critical safety risks, focusing solely on planning performance.<n>We present SafeAgentBench-the first benchmark for safety-aware task planning of embodied LLM agents in interactive simulation environments.<n>SafeAgentBench includes: (1) an executable, diverse, and high-quality dataset of 750 tasks, rigorously curated to cover 10 potential hazards and 3 task types; (2) SafeAgentEnv, a universal embodied environment with a low-level controller, supporting multi-agent execution with 17 high-level actions for 8 state-of-the-art baselines; and (3) reliable evaluation methods from both execution and semantic perspectives.
arXiv Detail & Related papers (2024-12-17T18:55:58Z)
SCANS: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation Steering [56.92068213969036]
Safety alignment is indispensable for Large Language Models (LLMs) to defend threats from malicious instructions.<n>Recent researches reveal safety-aligned LLMs prone to reject benign queries due to the exaggerated safety issue.<n>We propose a Safety-Conscious Activation Steering (SCANS) method to mitigate the exaggerated safety concerns.
arXiv Detail & Related papers (2024-08-21T10:01:34Z)
TrustAgent: Towards Safe and Trustworthy LLM-based Agents [50.33549510615024]
This paper presents an Agent-Constitution-based agent framework, TrustAgent, with a focus on improving the LLM-based agent safety. The proposed framework ensures strict adherence to the Agent Constitution through three strategic components: pre-planning strategy which injects safety knowledge to the model before plan generation, in-planning strategy which enhances safety during plan generation, and post-planning strategy which ensures safety by post-planning inspection.
arXiv Detail & Related papers (2024-02-02T17:26:23Z)
AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents [74.16170899755281]
We introduce AgentBoard, a pioneering comprehensive benchmark and accompanied open-source evaluation framework tailored to analytical evaluation of LLM agents.<n>AgentBoard offers a fine-grained progress rate metric that captures incremental advancements as well as a comprehensive evaluation toolkit.<n>This not only sheds light on the capabilities and limitations of LLM agents but also propels the interpretability of their performance to the forefront.
arXiv Detail & Related papers (2024-01-24T01:51:00Z)
Value Functions are Control Barrier Functions: Verification of Safe Policies using Control Theory [46.85103495283037]
We propose a new approach to apply verification methods from control theory to learned value functions. We formalize original theorems that establish links between value functions and control barrier functions. Our work marks a significant step towards a formal framework for the general, scalable, and verifiable design of RL-based control systems.
arXiv Detail & Related papers (2023-06-06T21:41:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.