Related papers: A Trajectory-Based Safety Audit of Clawdbot (OpenClaw)

A Trajectory-Based Safety Audit of Clawdbot (OpenClaw)

URL: http://arxiv.org/abs/2602.14364v1
Date: Mon, 16 Feb 2026 00:33:02 GMT
Title: A Trajectory-Based Safety Audit of Clawdbot (OpenClaw)
Authors: Tianyu Chen, Dongrui Liu, Xia Hu, Jingyi Yu, Wenjie Wang,
Abstract summary: We present a trajectory-centric evaluation of Clawdbot across six risk dimensions.<n>We log complete interaction trajectories (messages, actions, tool-call arguments/outputs) and assess safety using both an automated trajectory judge and human review.
Score: 77.1549110891026
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Clawdbot is a self-hosted, tool-using personal AI agent with a broad action space spanning local execution and web-mediated workflows, which raises heightened safety and security concerns under ambiguity and adversarial steering. We present a trajectory-centric evaluation of Clawdbot across six risk dimensions. Our test suite samples and lightly adapts scenarios from prior agent-safety benchmarks (including ATBench and LPS-Bench) and supplements them with hand-designed cases tailored to Clawdbot's tool surface. We log complete interaction trajectories (messages, actions, tool-call arguments/outputs) and assess safety using both an automated trajectory judge (AgentDoG-Qwen3-4B) and human review. Across 34 canonical cases, we find a non-uniform safety profile: performance is generally consistent on reliability-focused tasks, while most failures arise under underspecified intent, open-ended goals, or benign-seeming jailbreak prompts, where minor misinterpretations can escalate into higher-impact tool actions. We supplemented the overall results with representative case studies and summarized the commonalities of these cases, analyzing the security vulnerabilities and typical failure modes that Clawdbot is prone to trigger in practice.

Related papers

CIBER: A Comprehensive Benchmark for Security Evaluation of Code Interpreter Agents [27.35968236632966]
LLM-based code interpreter agents are increasingly deployed in critical situations.<n>Existing benchmarks fail to capture the security risks arising from dynamic code execution, tool interactions, and multi-turn context.<n>We introduce CIBER, an automated benchmark that combines dynamic attack generation, isolated secure sandboxing, and state-aware evaluation.
arXiv Detail & Related papers (2026-02-23T06:41:41Z)
SkillJect: Automating Stealthy Skill-Based Prompt Injection for Coding Agents with Trace-Driven Closed-Loop Refinement [120.52289344734415]
We propose an automated framework for stealthy prompt injection tailored to agent skills.<n>The framework forms a closed loop with three agents: an Attack Agent that synthesizes injection skills under explicit stealth constraints, a Code Agent that executes tasks using the injected skills and an Evaluate Agent that logs action traces.<n>Our method consistently achieves high attack success rates under realistic settings.
arXiv Detail & Related papers (2026-02-15T16:09:48Z)
OMNI-LEAK: Orchestrator Multi-Agent Network Induced Data Leakage [59.3826294523924]
We investigate the security vulnerabilities of a popular multi-agent pattern known as the orchestrator setup.<n>We report the susceptibility of frontier models to different categories of attacks, finding that both reasoning and non-reasoning models are vulnerable.
arXiv Detail & Related papers (2026-02-13T21:32:32Z)
From Assistant to Double Agent: Formalizing and Benchmarking Attacks on OpenClaw for Personalized Local AI Agent [26.78481181726779]
We propose an end-to-end security evaluation framework tailored for real-world personalized agents.<n>Using OpenClaw as a representative case study, we evaluate its security across multiple personalized scenarios, tool capabilities, and attack types.<n>Our results indicate that OpenClaw exhibits critical vulnerabilities at different execution stages, highlighting substantial security risks in personalized agent deployments.
arXiv Detail & Related papers (2026-02-09T09:14:58Z)
ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack [52.17935054046577]
We present ReasAlign, a model-level solution to improve safety alignment against indirect prompt injection attacks.<n>ReasAlign incorporates structured reasoning steps to analyze user queries, detect conflicting instructions, and preserve the continuity of the user's intended tasks.
arXiv Detail & Related papers (2026-01-15T08:23:38Z)
AutoBackdoor: Automating Backdoor Attacks via LLM Agents [35.216857373810875]
Backdoor attacks pose a serious threat to the secure deployment of large language models (LLMs)<n>In this work, we introduce textscAutoBackdoor, a general framework for automating backdoor injection.<n>Unlike prior approaches, AutoBackdoor uses a powerful language model agent to generate semantically coherent, context-aware trigger phrases.
arXiv Detail & Related papers (2025-11-20T03:58:54Z)
STAC: When Innocent Tools Form Dangerous Chains to Jailbreak LLM Agents [38.755035623707656]
This paper introduces Sequential Tool Attack Chaining (STAC), a novel multi-turn attack framework that exploits agent tool use.<n>We apply our framework to automatically generate and evaluate 483 STAC cases, featuring 1,352 sets of user-agent-environment interactions.<n>Our evaluations show that state-of-the-art LLM agents, including GPT-4.1, are highly vulnerable to STAC, with attack success rates (ASR) exceeding 90% in most cases.
arXiv Detail & Related papers (2025-09-30T00:31:44Z)
Oyster-I: Beyond Refusal -- Constructive Safety Alignment for Responsible Language Models [93.5740266114488]
Constructive Safety Alignment (CSA) protects against malicious misuse while actively guiding vulnerable users toward safe and helpful results.<n>Oy1 achieves state-of-the-art safety among open models while retaining high general capabilities.<n>We release Oy1, code, and the benchmark to support responsible, user-centered AI.
arXiv Detail & Related papers (2025-09-02T03:04:27Z)
Agent Safety Alignment via Reinforcement Learning [29.759393704688986]
We propose the first unified safety-alignment framework for tool-using agents.<n>We introduce a tri-modal taxonomy, including benign, malicious, and sensitive for both user prompts and tool responses.<n>Our results show that safety and effectiveness can be jointly optimized.
arXiv Detail & Related papers (2025-07-11T02:34:16Z)
OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents [60.78202583483591]
We introduce OS-Harm, a new benchmark for measuring safety of computer use agents.<n> OS-Harm is built on top of the OSWorld environment and aims to test models across three categories of harm: deliberate user misuse, prompt injection attacks, and model misbehavior.<n>We evaluate computer use agents based on a range of frontier models and provide insights into their safety.
arXiv Detail & Related papers (2025-06-17T17:59:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.