Fugu-MT 論文翻訳(概要): FreakOut-LLM: The Effect of Emotional Stimuli on Safety Alignment

論文の概要: FreakOut-LLM: The Effect of Emotional Stimuli on Safety Alignment

arxiv url: http://arxiv.org/abs/2604.04992v1
Date: Sun, 05 Apr 2026 13:37:52 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-08 17:42:09.388592
Title: FreakOut-LLM: The Effect of Emotional Stimuli on Safety Alignment
Title（参考訳）: FreakOut-LLM: 感情刺激が安全アライメントに及ぼす影響
Authors: Daniel Kuznetsov, Ofir Cohen, Karin Shistik, Rami Puzis, Asaf Shabtai,
Abstract要約: 安全に配慮したLSMは、有害な要求を拒否する訓練を拒否するが、これらのメカニズムが感情的な刺激の下で有効であるかどうかは不明である。本稿では,FreakOut-LLMというフレームワークを紹介した。
参考スコア（独自算出の注目度）: 13.02804082409836
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Safety-aligned LLMs go through refusal training to reject harmful requests, but whether these mechanisms remain effective under emotionally charged stimuli is unexplored. We introduce FreakOut-LLM, a framework investigating whether emotional context compromises safety alignment in adversarial settings. Using validated psychological stimuli, we evaluate how emotional priming through system prompts affects jailbreak susceptibility across ten LLMs. We test three conditions (stress, relaxation, neutral) using scenarios from established psychological protocols, plus a no-prompt baseline, and evaluate attack success using HarmBench on AdvBench prompts. Stress priming increases jailbreak success by 65.2\% compared to neutral conditions (z = 5.93, p < 0.001; OR = 1.67, Cohen's d = 0.28), while relaxation priming produces no effect (p = 0.84). Five of ten models show significant vulnerability, with the largest effects concentrated in open-weight models. Logistic regression on 59,800 queries confirms stress as the sole significant condition predictor after controlling for prompt length (p = 0.61) and model identity. Measured psychological state strongly predicts attack success (|r|\geq0.70 across five instruments; all p < 0.001 in individual-level logistic regression). These results establish emotional context as a measurable attack surface with implications for real-world AI deployment in high-stress domains.
Abstract（参考訳）: 安全に配慮したLSMは、有害な要求を拒否する訓練を拒否するが、これらのメカニズムが感情的な刺激の下で有効であるかどうかは不明である。本稿では,FreakOut-LLMというフレームワークを紹介した。実証された心理的刺激を用いて、システム刺激による感情的プライミングが10個のLDMのジェイルブレイク感受性にどのように影響するかを評価する。我々は、確立された心理学的プロトコルのシナリオとノンプロンプトベースラインを用いて、3つの条件(ストレス、リラックス、中立性)をテストし、AdvBenchプロンプト上でHarmBenchを用いて攻撃成功を評価する。ストレスプライミングは中性条件(z = 5.93, p < 0.001; OR = 1.67, Cohen's d = 0.28)と比較して65.2\%のジェイルブレイク成功率を増大させるが、緩和プライミングは効果を生じない(p = 0.84)。 10モデル中5モデルが重大な脆弱性を示しており、最大の影響はオープンウェイトモデルに集中している。 59,800クエリのロジスティック回帰は、プロンプト長(p = 0.61)とモデルアイデンティティを制御した後、ストレスを唯一の重要な状態予測器として確認する。測定された心理的状態は攻撃の成功を強く予測する(|r|\geq0.70は5つの楽器で、p < 0.001 である)。これらの結果は、高ストレス領域における実世界のAI展開に影響を及ぼす、測定可能な攻撃面として感情的コンテキストを確立する。

論文の概要: FreakOut-LLM: The Effect of Emotional Stimuli on Safety Alignment

関連論文リスト