Fugu-MT 論文翻訳(概要): Metaphor Is Not All Attention Needs

論文の概要: Metaphor Is Not All Attention Needs

arxiv url: http://arxiv.org/abs/2605.12128v1
Date: Tue, 12 May 2026 13:50:26 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-13 21:48:56.888691
Title: Metaphor Is Not All Attention Needs
Title（参考訳）: Metaphorは、すべての注意が必要なわけではない
Authors: Olga Sorokoletova, Francesco Giarrusso, Giacomo De Luca, Piercosma Bisconti, Matteo Prandi, Federico Pierucci, Marcello Galisai, Vincenzo Suriani, Daniele Nardi,
Abstract要約: 大規模な言語モデルは、有害な命令に抵抗する能力が不可欠である安全クリティカルなアプリケーションにますますデプロイされている。近年のエビデンスでは、詩的な変換のようなスタイル的な改革は、いまだに警告効果のある安全メカニズムを回避可能であることが示されている。それらの効果は、特定の詩的装置、文学的フォーマットの認識に失敗したこと、あるいはモデルがどのようにスタイリスティックに不規則なプロンプトを処理したかに左右されるかを検討する。
参考スコア（独自算出の注目度）: 1.3763052684269788
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models are increasingly deployed in safety-critical applications, where their ability to resist harmful instructions is essential. Although post-training aims to make models robust against many jailbreak strategies, recent evidence shows that stylistic reformulations, such as poetic transformation, can still bypass safety mechanisms with alarming effectiveness. This raises a central question: why do literary jailbreaks succeed? In this work, we investigate whether their effectiveness depends on specific poetic devices, on a failure to recognize literary formatting, or on deeper changes in how models process stylistically irregular prompts. We address this problem through an interpretability analysis of attention patterns. We perform input-level ablation studies to assess the contribution of individual and combinations of poetic devices; construct an interpretable vector representation of attention maps; cluster these representations and train linear probes to predict safety outcomes and literary format. Our results show that models distinguish poetic from prose formats with high accuracy, yet struggle to predict jailbreak success within each format. Clustering further reveals clear separation by literary format, but not by safety label. These findings indicate that jailbreak success is not caused by a failure to recognize poetic formatting; rather, poetic prompts induce distinct processing patterns that remain largely independent of harmful-content detection. Overall, literary jailbreaks appear to misalign large language models not through any single poetic device, but through accumulated stylistic irregularities that alter prompt processing and avoid lexical triggers considered during post-training. This suggests that robustness requires safety mechanisms that account for style-induced shifts in model behavior. We use Qwen3-14B as a representative open-weight case study.
Abstract（参考訳）: 大規模な言語モデルは、有害な命令に抵抗する能力が不可欠である安全クリティカルなアプリケーションにますますデプロイされている。ポストトレーニングは、多くのジェイルブレイク戦略に対してモデルを堅牢にすることを目的としているが、最近の証拠は、詩的な変換のようなスタイル的な改革が、アラーム効果のある安全メカニズムを回避できることを示している。なぜ文学的ジェイルブレイクが成功するのか? 本研究では,その効果が特定の詩的装置に依存しているか,文学的フォーマットの認識に失敗したか,あるいはモデルがどのようにスタイリスティックに不規則なプロンプトを処理しているのかについて検討する。本稿では,注意パターンの解釈可能性分析を通じてこの問題に対処する。入力レベルのアブレーション研究を行い、詩的装置の個々の寄与と組み合わせを評価し、注意マップの解釈可能なベクトル表現を構築し、これらの表現をクラスタ化し、安全結果と文学的形式を予測するために線形プローブを訓練する。以上の結果から, 散文形式と散文形式を高い精度で区別するが, それぞれの形式におけるジェイルブレイクの成功を予測するのに苦慮していることがわかった。クラスタリングはさらに、文学的な形式による明確な分離を明らかにしているが、安全ラベルによるものではない。これらの結果は、ジェイルブレイクの成功は、詩的なフォーマッティングを認識できないことによるものではなく、むしろ、詩的なプロンプトは有害なコンテンツ検出に大きく依存しない、異なる処理パターンを誘導することを示している。全体として、文学的ジェイルブレイクは、単一の詩的な装置を通してではなく、処理を迅速に変更し、ポストトレーニング中に考慮された語彙的トリガーを避けるための、蓄積されたスタイル上の不規則を通して、大きな言語モデルを誤認しているように見える。このことは、ロバスト性はモデル行動のスタイルによるシフトを考慮に入れた安全メカニズムを必要とすることを示唆している。我々はQwen3-14Bを代表的オープンウェイトケーススタディとして使用している。

論文の概要: Metaphor Is Not All Attention Needs

関連論文リスト