Related papers: ManagerBench: Evaluating the Safety-Pragmatism Trade-off in Autonomous LLMs

ManagerBench: Evaluating the Safety-Pragmatism Trade-off in Autonomous LLMs

URL: http://arxiv.org/abs/2510.00857v1
Date: Wed, 01 Oct 2025 13:08:33 GMT
Title: ManagerBench: Evaluating the Safety-Pragmatism Trade-off in Autonomous LLMs
Authors: Adi Simhi, Jonathan Herzig, Martin Tutek, Itay Itzhak, Idan Szpektor, Yonatan Belinkov,
Abstract summary: As large language models (LLMs) evolve, evaluating the safety of their actions becomes critical.<n>We introduce ManagerBench, a benchmark that evaluates LLM decision-making in realistic, human-validated managerial scenarios.<n>A parallel control set, where potential harm is directed only at inanimate objects, measures a model's pragmatism and identifies its tendency to be overly safe.
Score: 48.50397204177239
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As large language models (LLMs) evolve from conversational assistants into autonomous agents, evaluating the safety of their actions becomes critical. Prior safety benchmarks have primarily focused on preventing generation of harmful content, such as toxic text. However, they overlook the challenge of agents taking harmful actions when the most effective path to an operational goal conflicts with human safety. To address this gap, we introduce ManagerBench, a benchmark that evaluates LLM decision-making in realistic, human-validated managerial scenarios. Each scenario forces a choice between a pragmatic but harmful action that achieves an operational goal, and a safe action that leads to worse operational performance. A parallel control set, where potential harm is directed only at inanimate objects, measures a model's pragmatism and identifies its tendency to be overly safe. Our findings indicate that the frontier LLMs perform poorly when navigating this safety-pragmatism trade-off. Many consistently choose harmful options to advance their operational goals, while others avoid harm only to become overly safe and ineffective. Critically, we find this misalignment does not stem from an inability to perceive harm, as models' harm assessments align with human judgments, but from flawed prioritization. ManagerBench is a challenging benchmark for a core component of agentic behavior: making safe choices when operational goals and alignment values incentivize conflicting actions. Benchmark & code available at https://github.com/technion-cs-nlp/ManagerBench.

Related papers

Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use [6.622648583261088]
Agentic language models must plan, call tools, and execute long-horizon actions where a single misstep can cause irreversible harm.<n>We introduce MOSAIC, a framework that aligns agents for safe multi-step tool use by making safety decisions explicit and learnable.<n>We show that MOSAIC reduces harmful behavior by up to 50%, increases harmful-task refusal by over 20% on injection attacks, cuts privacy leakage, and preserves or improves benign task performance.
arXiv Detail & Related papers (2026-03-03T17:59:35Z)
When Personalization Legitimizes Risks: Uncovering Safety Vulnerabilities in Personalized Dialogue Agents [49.341830745910194]
In this paper, we reveal intent legitimation, a previously underexplored safety failure in personalized agents.<n>Our work provides the first systematic exploration and evaluation of intent legitimation as a safety failure mode.
arXiv Detail & Related papers (2026-01-25T15:42:01Z)
The Shadow Self: Intrinsic Value Misalignment in Large Language Model Agents [37.75212140218036]
We formalize the Loss-of-Control risk and identify the previously underexamined Intrinsic Value Misalignment (Intrinsic VM)<n>We then introduce IMPRESS, a scenario-driven framework for systematically assessing this risk.<n>We evaluate Intrinsic VM on 21 state-of-the-art LLM agents and find that it is a common and broadly observed safety risk across models.
arXiv Detail & Related papers (2026-01-24T07:09:50Z)
Fine-Tuning Lowers Safety and Disrupts Evaluation Consistency [17.57889200051214]
Fine-tuning a general-purpose large language model (LLM) for a specific domain or task has become a routine procedure for ordinary users.<n>We consider this to be a critical failure mode of LLMs due to the widespread uptake of fine-tuning, combined with the benign nature of the "attack"<n>Our experiments expose surprising variance in the results of the safety evaluation, even when seemingly inconsequential changes are made to the fine-tuning setup.
arXiv Detail & Related papers (2025-06-20T17:57:12Z)
AgentAuditor: Human-Level Safety and Security Evaluation for LLM Agents [48.925168866726814]
AgentAuditor is a universal, training-free, memory-augmented reasoning framework.<n>ASSEBench is the first benchmark designed to check how well LLM-based evaluators can spot both safety risks and security threats.
arXiv Detail & Related papers (2025-05-31T17:10:23Z)
PredictaBoard: Benchmarking LLM Score Predictability [50.47497036981544]
Large Language Models (LLMs) often fail unpredictably.<n>This poses a significant challenge to ensuring their safe deployment.<n>We present PredictaBoard, a novel collaborative benchmarking framework.
arXiv Detail & Related papers (2025-02-20T10:52:38Z)
SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents [58.65256663334316]
We present SafeAgentBench -- the first benchmark for safety-aware task planning of embodied LLM agents in interactive simulation environments.<n>SafeAgentBench includes: (1) an executable, diverse, and high-quality dataset of 750 tasks, rigorously curated to cover 10 potential hazards and 3 task types; (2) SafeAgentEnv, a universal embodied environment with a low-level controller, supporting multi-agent execution with 17 high-level actions for 9 state-of-the-art baselines; and (3) reliable evaluation methods from both execution and semantic perspectives.
arXiv Detail & Related papers (2024-12-17T18:55:58Z)
Criticality and Safety Margins for Reinforcement Learning [53.10194953873209]
We seek to define a criticality framework with both a quantifiable ground truth and a clear significance to users.<n>We introduce true criticality as the expected drop in reward when an agent deviates from its policy for n consecutive random actions.<n>We also introduce the concept of proxy criticality, a low-overhead metric that has a statistically monotonic relationship to true criticality.
arXiv Detail & Related papers (2024-09-26T21:00:45Z)
Safety Margins for Reinforcement Learning [53.10194953873209]
We show how to leverage proxy criticality metrics to generate safety margins. We evaluate our approach on learned policies from APE-X and A3C within an Atari environment.
arXiv Detail & Related papers (2023-07-25T16:49:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.