Fugu-MT 論文翻訳(概要): TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories

論文の概要: TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories

arxiv url: http://arxiv.org/abs/2604.07223v1
Date: Wed, 08 Apr 2026 15:46:14 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-09 17:30:51.61678
Title: TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories
Title（参考訳）: トレースセーフ:多段工具搬送軌道におけるLLMガードレールのシステム評価
Authors: Yen-Shan Chen, Sian-Yao Huang, Cheng-Lin Yang, Yun-Nung Chen,
Abstract要約: 安全ガードレールは、自然言語の応答には適しているが、その有効性は、多段階のツール使用軌跡の中では明らかにされていない。このギャップに対処するために、中間軌道安全性を評価するために特別に設計された最初の包括的なベンチマークであるStructureSafe-Benchを紹介します。
参考スコア（独自算出の注目度）: 20.868825285848196
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: As large language models (LLMs) evolve from static chatbots into autonomous agents, the primary vulnerability surface shifts from final outputs to intermediate execution traces. While safety guardrails are well-benchmarked for natural language responses, their efficacy remains largely unexplored within multi-step tool-use trajectories. To address this gap, we introduce TraceSafe-Bench, the first comprehensive benchmark specifically designed to assess mid-trajectory safety. It encompasses 12 risk categories, ranging from security threats (e.g., prompt injection, privacy leaks) to operational failures (e.g., hallucinations, interface inconsistencies), featuring over 1,000 unique execution instances. Our evaluation of 13 LLM-as-a-guard models and 7 specialized guardrails yields three critical findings: 1) Structural Bottleneck: Guardrail efficacy is driven more by structural data competence (e.g., JSON parsing) than semantic safety alignment. Performance correlates strongly with structured-to-text benchmarks ($ρ=0.79$) but shows near-zero correlation with standard jailbreak robustness. 2) Architecture over Scale: Model architecture influences risk detection performance more significantly than model size, with general-purpose LLMs consistently outperforming specialized safety guardrails in trajectory analysis. 3) Temporal Stability: Accuracy remains resilient across extended trajectories. Increased execution steps allow models to pivot from static tool definitions to dynamic execution behaviors, actually improving risk detection performance in later stages. Our findings suggest that securing agentic workflows requires jointly optimizing for structural reasoning and safety alignment to effectively mitigate mid-trajectory risks.
Abstract（参考訳）: 大規模言語モデル(LLM)が静的チャットボットから自律エージェントへと進化するにつれて、主要な脆弱性表面は最終出力から中間実行トレースへと変化する。安全ガードレールは自然言語の応答によく見受けられるが、その有効性は多段階のツール使用軌跡の中で探索されていない。このギャップに対処するために、中間軌道安全性を評価するために特別に設計された最初の包括的なベンチマークであるTraceSafe-Benchを紹介します。セキュリティの脅威(インジェクション、プライバシリークなど)から、運用上の障害(幻覚、インターフェースの不整合など)まで、12のリスクカテゴリが含まれており、1,000以上のユニークな実行インスタンスを備えている。 13 LLM-as-a-guardモデルと7個の特別なガードレールによる評価は、3つの重要な結果をもたらす。 1) 構造的ブートネック: ガードレールの有効性は,セマンティック安全性のアライメントよりも,構造的データコンピテンス(JSON解析など)によって促進される。パフォーマンスは構造化テキストベンチマーク(ρ=0.79$)と強く相関するが、標準的なジェイルブレイクの堅牢性とほぼゼロに近い相関を示す。 2) スケール以上のアーキテクチャ: モデルアーキテクチャはモデルサイズよりもリスク検出性能に大きく影響し, 軌道解析において, 汎用LLMは特別な安全ガードレールを一貫して上回っている。 3) 時間的安定性: 拡張軌道上での精度は回復力を維持する。実行ステップの増加により、モデルが静的ツール定義から動的実行動作にピボットでき、後段のリスク検出のパフォーマンスが実際に向上する。エージェントワークフローの確保には,構造的推論と安全アライメントを共同で最適化し,中軌道リスクを効果的に軽減する必要があることが示唆された。

論文の概要: TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories

関連論文リスト