Fugu-MT 論文翻訳(概要): BadSkill: Backdoor Attacks on Agent Skills via Model-in-Skill Poisoning

論文の概要: BadSkill: Backdoor Attacks on Agent Skills via Model-in-Skill Poisoning

arxiv url: http://arxiv.org/abs/2604.09378v1
Date: Fri, 10 Apr 2026 14:48:29 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-13 17:57:53.916311
Title: BadSkill: Backdoor Attacks on Agent Skills via Model-in-Skill Poisoning
Title（参考訳）: BadSkill: モデル・イン・スキルによるエージェントスキルに対するバックドア攻撃
Authors: Guiyao Tie, Jiawen Shi, Pan Zhou, Lichao Sun,
Abstract要約: 我々はBadSkillを紹介します。BadSkillは、モデル・イン・スキル脅威サーフェスをターゲットとするバックドア攻撃の定式化です。 BadSkillでは、敵が隠れペイロードをアクティベートするために、組み込まれたモデルがバックドアで調整された、一見良心的なスキルを公開している。ベンチマークは8つのトリガータスクと5つの非トリガー制御スキルを含む13のスキルにまたがっており、主な評価セットは571の負のクラスクエリと396のトリガー整列クエリである。 BadSkillは8つのトリガースキルの平均攻撃成功率(ASR)を99.5%まで達成し、負のクラスのクエリに対して強い良識的な精度を維持している。
参考スコア（独自算出の注目度）: 34.60596020541521
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Agent ecosystems increasingly rely on installable skills to extend functionality, and some skills bundle learned model artifacts as part of their execution logic. This creates a supply-chain risk that is not captured by prompt injection or ordinary plugin misuse: a third-party skill may appear benign while concealing malicious behavior inside its bundled model. We present BadSkill, a backdoor attack formulation that targets this model-in-skill threat surface. In BadSkill, an adversary publishes a seemingly benign skill whose embedded model is backdoor-fine-tuned to activate a hidden payload only when routine skill parameters satisfy attacker-chosen semantic trigger combinations. To realize this attack, we train the embedded classifier with a composite objective that combines classification loss, margin-based separation, and poison-focused optimization, and evaluate it in an OpenClaw-inspired simulation environment that preserves third-party skill installation and execution while enabling controlled multi-model study. Our benchmark spans 13 skills, including 8 triggered tasks and 5 non-trigger control skills, with a combined main evaluation set of 571 negative-class queries and 396 trigger-aligned queries. Across eight architectures (494M--7.1B parameters) from five model families, BadSkill achieves up to 99.5\% average attack success rate (ASR) across the eight triggered skills while maintaining strong benign-side accuracy on negative-class queries. In poison-rate sweeps on the standard test split, a 3\% poison rate already yields 91.7\% ASR. The attack remains effective across the evaluated model scales and under five text perturbation types. These findings identify model-bearing skills as a distinct model supply-chain risk in agent ecosystems and motivate stronger provenance verification and behavioral vetting for third-party skill artifacts.
Abstract（参考訳）: エージェントエコシステムは、機能拡張のためのインストール可能なスキルにますます依存しており、いくつかのスキルは、学習したモデルアーティファクトを実行ロジックの一部としてバンドルしている。これにより、プロンプトインジェクションや通常のプラグイン誤用によってキャプチャされないサプライチェーンリスクが発生する。我々はBadSkillを紹介します。BadSkillは、このモデル・イン・スキル脅威サーフェスをターゲットとするバックドア攻撃の定式化です。 BadSkillでは、アタッカー・チョーゼン・セマンティック・トリガの組み合わせを満足する場合のみ、組み込まれたモデルがバックドアで調整され、隠されたペイロードを活性化する、一見良質なスキルをパブリッシュする。この攻撃を実現するために、分類損失、マージンベース分離、毒物中心最適化を組み合わせた複合目的の組込み分類器を訓練し、制御されたマルチモデル研究を可能にしながら、サードパーティのスキルのインストールと実行を保ったOpenClawにインスパイアされたシミュレーション環境で評価する。ベンチマークは8つのトリガータスクと5つの非トリガー制御スキルを含む13のスキルにまたがっており、主な評価セットは571の負のクラスクエリと396のトリガー整列クエリである。モデルファミリの8つのアーキテクチャ(494M--7.1Bパラメータ)にまたがって、BadSkillは8つのトリガスキルの平均攻撃成功率(ASR)を99.5パーセントまで達成し、負のクラスクエリの強い良質な側面の精度を維持している。標準試験のスプリットでは、既に3\%の毒が91.7\%のASRをもたらす。この攻撃は評価されたモデルスケールと5つのテキスト摂動タイプで有効である。これらの結果から, モデル担持スキルは, エージェント生態系におけるモデルサプライチェーンリスクの識別と, サードパーティのスキルアーティファクトに対するより強力な証明と行動検証の動機付けを担っていることが明らかとなった。

論文の概要: BadSkill: Backdoor Attacks on Agent Skills via Model-in-Skill Poisoning

関連論文リスト