Fugu-MT 論文翻訳(概要): DecodingTrust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents

論文の概要: DecodingTrust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents

arxiv url: http://arxiv.org/abs/2605.04808v1
Date: Wed, 06 May 2026 11:59:48 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-07 18:41:07.800867
Title: DecodingTrust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents
Title（参考訳）: DecodingTrust-Agent Platform (DTap):AIエージェントのためのコントロール可能でインタラクティブなレッドチームプラットフォーム
Authors: Zhaorun Chen, Xun Liu, Haibo Tong, Chengquan Guo, Yuzhou Nie, Jiawei Zhang, Mintong Kang, Chejian Xu, Qichang Liu, Xiaogeng Liu, Tianneng Shi, Chaowei Xiao, Sanmi Koyejo, Percy Liang, Wenbo Guo, Dawn Song, Bo Li,
Abstract要約: DecodingTrust-Agent Platform (DTap)は、AIエージェントのためのコントロール可能でインタラクティブなレッドチームプラットフォームである。 DTap-Redは、多様なインジェクションベクターを探索し、効果的な攻撃戦略を自律的に発見する、最初の自律的赤チームエージェントである。 DTapを通じて、さまざまなバックボーンモデル上に構築された一般的なAIエージェントの大規模評価を行う。
参考スコア（独自算出の注目度）: 121.77550256034
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: AI agents are increasingly deployed across diverse domains to automate complex workflows through long-horizon and high-stakes action executions. Due to their high capability and flexibility, such agents raise significant security and safety concerns. A growing number of real-world incidents have shown that adversaries can easily manipulate agents into performing harmful actions, such as leaking API keys, deleting user data, or initiating unauthorized transactions. Evaluating agent security is inherently challenging, as agents operate in dynamic, untrusted environments involving external tools, heterogeneous data sources, and frequent user interactions. However, realistic, controllable, and reproducible environments for large-scale risk assessment remain largely underexplored. To address this gap, we introduce the DecodingTrust-Agent Platform (DTap), the first controllable and interactive red-teaming platform for AI agents, spanning 14 real-world domains and over 50 simulation environments that replicate widely used systems such as Google Workspace, Paypal, and Slack. To scale the risk assessment of agents in DTap, we further propose DTap-Red, the first autonomous red-teaming agent that systematically explores diverse injection vectors (e.g., prompt, tool, skill, environment, combinations) and autonomously discovers effective attack strategies tailored to varying malicious goals. Using DTap-Red, we curate DTap-Bench, a large-scale red-teaming dataset comprising high-quality instances across domains, each paired with a verifiable judge to automatically validate attack outcomes. Through DTap, we conduct large-scale evaluations of popular AI agents built on various backbone models, spanning security policies, risk categories, and attack strategies, revealing systematic vulnerability patterns and providing valuable insights for developing secure next-generation agents.
Abstract（参考訳）: AIエージェントは、多種多様なドメインにまたがって展開され、長い水平かつ高いアクション実行を通じて複雑なワークフローを自動化する。高い能力と柔軟性のため、これらのエージェントは重大なセキュリティと安全性の懸念を引き起こす。現実のインシデントが増えていることから、APIキーの漏洩、ユーザデータの削除、不正なトランザクションの開始といった有害なアクションの実行に、敵が簡単にエージェントを操作できることが示されている。エージェントが外部ツール、異種データソース、頻繁なユーザインタラクションを含む、動的で信頼できない環境で運用されているため、エージェントのセキュリティを評価することは本質的に難しい。しかし、大規模なリスク評価のための現実的で、制御可能で、再現可能な環境は、いまだに過小評価されていない。このギャップに対処するために、DecodingTrust-Agent Platform (DTap)を紹介します。これは、AIエージェントのための、初めて制御可能でインタラクティブな赤チームプラットフォームで、14の現実世界のドメインと、Google Workspace、Paypal、Slackといった広く使用されているシステムを再現する50以上のシミュレーション環境にまたがる。さらに,DTapにおけるエージェントのリスク評価をスケールするために,多様なインジェクションベクター(例えば,プロンプト,ツール,スキル,環境,組み合わせ)を体系的に探索し,様々な悪意のある目標に合わせた効果的な攻撃戦略を自律的に発見する,最初の自律的リピートエージェントDTap-Redを提案する。 DTap-Redを使って、ドメイン間の高品質なインスタンスで構成される大規模なレッドチームデータセットDTap-Benchをキュレートし、それぞれが検証可能な判断器と組み合わせて、攻撃結果を自動的に検証する。 DTapを通じて、さまざまなバックボーンモデル上に構築された一般的なAIエージェントの大規模な評価を行い、セキュリティポリシー、リスクカテゴリ、攻撃戦略を分散し、体系的な脆弱性パターンを明らかにし、セキュアな次世代エージェントを開発する上で貴重な洞察を提供する。

論文の概要: DecodingTrust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents

関連論文リスト