Fugu-MT 論文翻訳(概要): A Self-Improving Architecture for Dynamic Safety in Large Language Models

論文の概要: A Self-Improving Architecture for Dynamic Safety in Large Language Models

arxiv url: http://arxiv.org/abs/2511.07645v1
Date: Wed, 12 Nov 2025 01:09:12 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-12 20:17:03.409859
Title: A Self-Improving Architecture for Dynamic Safety in Large Language Models
Title（参考訳）: 大規模言語モデルにおける動的安全のための自己改善型アーキテクチャ
Authors: Tyler Slater,
Abstract要約: LLM(Large Language Models)のコアソフトウェアシステムへの統合が加速している。既存のソフトウェアアーキテクチャパターンは静的だが、現在の安全性保証方法はスケーラブルではない。動的フィードバックループと非保護で非整合なベースLLMを結合するランタイムアーキテクチャを提案する。このループは、違反検出のためのAI Adjudicator (GPT-4o) と、新しい一般化された安全ポリシーを自律的に生成するポリシー合成モジュール (GPT-4 Turbo) で構成されている。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Context: The integration of Large Language Models (LLMs) into core software systems is accelerating. However, existing software architecture patterns are static, while current safety assurance methods are not scalable, leaving systems vulnerable to novel adversarial threats. Objective: To design, implement, and evaluate a novel software architecture that enables an AI-driven system to autonomously and continuously adapt its own safety protocols at runtime. Method: We propose the Self-Improving Safety Framework (SISF), a runtime architecture that couples an unprotected, unaligned base LLM (mistralai/Mistral-7B-v0.1) with a dynamic feedback loop. This loop consists of an AI Adjudicator (GPT-4o) for breach detection and a Policy Synthesis Module (GPT-4 Turbo) that autonomously generates new, generalized safety policies (both heuristic and semantic) in response to failures. Results: We conducted a dynamic learning evaluation using the 520-prompt AdvBench dataset. The unprotected model was 100% vulnerable. Our SISF, starting from zero policies, demonstrated a clear learning curve: it detected 237 breaches, autonomously synthesized 234 new policies, and reduced the overall Attack Success Rate (ASR) to 45.58%. In a subsequent test on 520 benign prompts, the SISF achieved a 0.00% False Positive Rate (FPR), proving its ability to adapt without compromising user utility. Conclusion: An architectural approach to AI safety, based on the principles of self-adaptation, is a viable and effective strategy. Our framework demonstrates a practical path towards building more robust, resilient, and scalable AI-driven systems, shifting safety assurance from a static, pre-deployment activity to an automated, runtime process.
Abstract（参考訳）: コンテキスト: コアソフトウェアシステムへのLLM(Large Language Models)の統合が加速しています。しかし、既存のソフトウェアアーキテクチャパターンは静的であり、現在の安全性保証手法は拡張性がなく、システムは新たな敵の脅威に弱いままである。目的: AI駆動システムにおいて,実行時に自身の安全プロトコルを自律的かつ継続的に適用可能な,新たなソフトウェアアーキテクチャの設計,実装,評価を行う。方法: 自己改善安全フレームワーク(SISF, Self-Improving Safety Framework)は, 動的フィードバックループと非保護・非整合ベースLLM(mistralai/Mistral-7B-v0.1)を結合するランタイムアーキテクチャである。このループは、違反検出のためのAI Adjudicator (GPT-4o) と、障害に対応するために新しく一般化された安全ポリシー(ヒューリスティックとセマンティックの両方)を自律的に生成するポリシー合成モジュール (GPT-4 Turbo) で構成されている。結果: 520-promptのAdvBenchデータセットを用いて動的学習評価を行った。保護されていないモデルは100%脆弱であった。 SISFは、ゼロポリシーから始まり、237件の違反を検出し、234件の新しいポリシーを自律的に合成し、全体的なアタック成功率(ASR)を45.58%に下げた。その後の520回の試験では、SISFは偽陽性率(FPR)を0.00%達成し、ユーザの有用性を損なうことなく適応する能力を示した。結論: AIの安全性に対するアーキテクチャ的アプローチは、自己適応の原則に基づいて、実行可能で効果的な戦略です。我々のフレームワークは、より堅牢でレジリエントでスケーラブルなAI駆動システムを構築するための実践的な道を示し、安全保証を静的でデプロイ前のアクティビティから自動化されたランタイムプロセスに移行する。

論文の概要: A Self-Improving Architecture for Dynamic Safety in Large Language Models

関連論文リスト