Fugu-MT 論文翻訳(概要): SafeBehavior: Simulating Human-Like Multistage Reasoning to Mitigate Jailbreak Attacks in Large Language Models

論文の概要: SafeBehavior: Simulating Human-Like Multistage Reasoning to Mitigate Jailbreak Attacks in Large Language Models

arxiv url: http://arxiv.org/abs/2509.26345v1
Date: Tue, 30 Sep 2025 14:50:59 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-01 14:45:00.172798
Title: SafeBehavior: Simulating Human-Like Multistage Reasoning to Mitigate Jailbreak Attacks in Large Language Models
Title（参考訳）: SafeBehavior: 大規模言語モデルにおけるジェイルブレイク攻撃の軽減を目的としたヒューマンライクなマルチステージ推論のシミュレーション
Authors: Qinjian Zhao, Jiaqi Wang, Zhiqiang Gao, Zhihao Dou, Belal Abuhaija, Kaizhu Huang,
Abstract要約: 大規模言語モデル(LLM)は、さまざまな自然言語処理タスクで素晴らしいパフォーマンスを実現している。しかし、彼らの成長力は、ビルトインの安全メカニズムを回避するジェイルブレイク攻撃のような潜在的なリスクを増幅する。本研究では,ヒトの適応的多段階推論過程をシミュレートする新しい階層型ジェイルブレイク防御機構であるSafeBehaviorを提案する。
参考スコア（独自算出の注目度）: 27.607151919652267
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have achieved impressive performance across diverse natural language processing tasks, but their growing power also amplifies potential risks such as jailbreak attacks that circumvent built-in safety mechanisms. Existing defenses including input paraphrasing, multi step evaluation, and safety expert models often suffer from high computational costs, limited generalization, or rigid workflows that fail to detect subtle malicious intent embedded in complex contexts. Inspired by cognitive science findings on human decision making, we propose SafeBehavior, a novel hierarchical jailbreak defense mechanism that simulates the adaptive multistage reasoning process of humans. SafeBehavior decomposes safety evaluation into three stages: intention inference to detect obvious input risks, self introspection to assess generated responses and assign confidence based judgments, and self revision to adaptively rewrite uncertain outputs while preserving user intent and enforcing safety constraints. We extensively evaluate SafeBehavior against five representative jailbreak attack types including optimization based, contextual manipulation, and prompt based attacks and compare it with seven state of the art defense baselines. Experimental results show that SafeBehavior significantly improves robustness and adaptability across diverse threat scenarios, offering an efficient and human inspired approach to safeguarding LLMs against jailbreak attempts.
Abstract（参考訳）: 大きな言語モデル(LLM)は、さまざまな自然言語処理タスクで素晴らしいパフォーマンスを達成したが、その成長力は、ビルトインの安全メカニズムを回避するジェイルブレイク攻撃のような潜在的なリスクを増幅する。入力パラフレーズ、マルチステップ評価、安全専門家モデルを含む既存の防御は、しばしば高い計算コスト、限られた一般化、複雑なコンテキストに埋め込まれた微妙な悪意のある意図を検知できない厳密なワークフローに悩まされる。人間の意思決定に関する認知科学的な知見に触発され,人間の適応的多段階推論過程をシミュレートする新しい階層型ジェイルブレイク防御機構であるSafeBehaviorを提案する。 SafeBehaviorは、明確な入力リスクを検出する意図推論、生成した応答を評価し、信頼に基づく判断を割り当てる自己検査、ユーザ意図を維持し安全制約を強制しながら不確実な出力を適応的に書き換える自己修正の3段階に分割する。我々はSafeBehaviorを最適化ベース、文脈操作、プロンプトベース攻撃を含む5つの代表的なジェイルブレイク攻撃タイプに対して広範囲に評価し、7つの最先端防衛ベースラインと比較した。実験結果から,SafeBehaviorは多様な脅威シナリオに対する堅牢性と適応性を著しく向上し,Jailbreakの試みからLLMを保護するための効率的で人為的なアプローチを提供することがわかった。

論文の概要: SafeBehavior: Simulating Human-Like Multistage Reasoning to Mitigate Jailbreak Attacks in Large Language Models

関連論文リスト