Fugu-MT 論文翻訳(概要): ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack

論文の概要: ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack

arxiv url: http://arxiv.org/abs/2601.10173v1
Date: Thu, 15 Jan 2026 08:23:38 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-16 19:43:19.054553
Title: ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack
Title（参考訳）: ReasAlign: プロンプトインジェクション攻撃に対する安全性向上策
Authors: Hao Li, Yankai Yang, G. Edward Suh, Ning Zhang, Chaowei Xiao,
Abstract要約: 本稿では、間接的インジェクション攻撃に対する安全性アライメントを改善するためのモデルレベルのソリューションであるReasAlignを提案する。 ReasAlignには、ユーザクエリの分析、競合する命令の検出、ユーザの意図したタスクの継続性を維持するための構造化された推論ステップが組み込まれている。
参考スコア（独自算出の注目度）: 52.17935054046577
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have enabled the development of powerful agentic systems capable of automating complex workflows across various fields. However, these systems are highly vulnerable to indirect prompt injection attacks, where malicious instructions embedded in external data can hijack agent behavior. In this work, we present ReasAlign, a model-level solution to improve safety alignment against indirect prompt injection attacks. The core idea of ReasAlign is to incorporate structured reasoning steps to analyze user queries, detect conflicting instructions, and preserve the continuity of the user's intended tasks to defend against indirect injection attacks. To further ensure reasoning logic and accuracy, we introduce a test-time scaling mechanism with a preference-optimized judge model that scores reasoning steps and selects the best trajectory. Comprehensive evaluations across various benchmarks show that ReasAlign maintains utility comparable to an undefended model while consistently outperforming Meta SecAlign, the strongest prior guardrail. On the representative open-ended CyberSecEval2 benchmark, which includes multiple prompt-injected tasks, ReasAlign achieves 94.6% utility and only 3.6% ASR, far surpassing the state-of-the-art defensive model of Meta SecAlign (56.4% utility and 74.4% ASR). These results demonstrate that ReasAlign achieves the best trade-off between security and utility, establishing a robust and practical defense against prompt injection attacks in real-world agentic systems. Our code and experimental results could be found at https://github.com/leolee99/ReasAlign.
Abstract（参考訳）: 大規模言語モデル(LLM)により、様々な分野にわたる複雑なワークフローを自動化できる強力なエージェントシステムの開発が可能になった。しかし、これらのシステムは間接的なインジェクション攻撃に対して非常に脆弱であり、外部データに埋め込まれた悪意のある命令はエージェントの動作をハイジャックする可能性がある。本研究では,間接的インジェクション攻撃に対する安全性アライメントを改善するためのモデルレベルのソリューションであるReasAlignを提案する。 ReasAlignの中核となる考え方は、ユーザクエリを分析し、競合する命令を検出し、間接的なインジェクション攻撃から守るために、ユーザの意図したタスクの継続性を維持する、構造化された推論ステップを組み込むことである。推論の論理と精度をさらに高めるため、選好最適化判定モデルを用いたテスト時間スケーリング機構を導入し、推論ステップをスコアし、最適な軌道を選択する。さまざまなベンチマークの総合的な評価によると、ReasAlignは未定義モデルに匹敵するユーティリティを維持しつつ、最強のガードレールであるMeta SecAlignを一貫して上回っている。 ReasAlignは94.6%のユーティリティとわずか3.6%のASRを達成し、Meta SecAlignの最先端防衛モデル(56.4%のユーティリティと74.4%のASR)をはるかに上回っている。これらの結果は、ReasAlignがセキュリティとユーティリティの最良のトレードオフを実現し、現実世界のエージェントシステムにおける迅速なインジェクション攻撃に対する堅牢で実践的な防御を確立していることを示している。私たちのコードと実験結果はhttps://github.com/leolee99/ReasAlign.comで確認できます。

論文の概要: ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack

関連論文リスト