Fugu-MT 論文翻訳(概要): From Adversarial Poetry to Adversarial Tales: An Interpretability Research Agenda

論文の概要: From Adversarial Poetry to Adversarial Tales: An Interpretability Research Agenda

arxiv url: http://arxiv.org/abs/2601.08837v2
Date: Fri, 16 Jan 2026 13:45:43 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-25 16:54:51.657074
Title: From Adversarial Poetry to Adversarial Tales: An Interpretability Research Agenda
Title（参考訳）: 敵の詩から敵の物語へ : 解釈可能性研究アジェンダ
Authors: Piercosma Bisconti, Marcello Galisai, Matteo Prandi, Federico Pierucci, Olga Sorokoletova, Francesco Giarrusso, Vincenzo Suriani, Marcantonio Bracale Syrnikov, Daniele Nardi,
Abstract要約: 本稿では,サイバーパンクの物語に有害なコンテンツを埋め込むジェイルブレイク技術であるAdversarial Talesを紹介する。平均攻撃成功率は71.3%であり、モデルファミリーが確実に堅牢であることが証明されていない。
参考スコア（独自算出の注目度）: 1.3763052684269788
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Safety mechanisms in LLMs remain vulnerable to attacks that reframe harmful requests through culturally coded structures. We introduce Adversarial Tales, a jailbreak technique that embeds harmful content within cyberpunk narratives and prompts models to perform functional analysis inspired by Vladimir Propp's morphology of folktales. By casting the task as structural decomposition, the attack induces models to reconstruct harmful procedures as legitimate narrative interpretation. Across 26 frontier models from nine providers, we observe an average attack success rate of 71.3%, with no model family proving reliably robust. Together with our prior work on Adversarial Poetry, these findings suggest that structurally-grounded jailbreaks constitute a broad vulnerability class rather than isolated techniques. The space of culturally coded frames that can mediate harmful intent is vast, likely inexhaustible by pattern-matching defenses alone. Understanding why these attacks succeed is therefore essential: we outline a mechanistic interpretability research agenda to investigate how narrative cues reshape model representations and whether models can learn to recognize harmful intent independently of surface form.
Abstract（参考訳）: LLMの安全性メカニズムは、文化的にコード化された構造を通じて有害な要求をリフレッシュする攻撃に対して脆弱なままである。本稿では,サイバーパンク物語に有害なコンテンツを埋め込んだジェイルブレイク手法であるAdversarial Talesを紹介し,ウラジーミル・プロップの民話形態に触発された機能解析をモデルに促す。タスクを構造的分解としてキャストすることで、攻撃は有害な手順を正当な物語解釈として再構築するモデルを誘導する。 9つのプロバイダーの26のフロンティアモデルのうち、平均的な攻撃成功率は71.3%であり、モデルファミリーが確実に堅牢であることが証明されていない。これらの結果から, 脱獄は孤立した手法ではなく, 幅広い脆弱性クラスを構成することが示唆された。有害な意図を媒介する、文化的にコーディングされたフレームの空間は、パターンマッチングの防御だけでは、おそらく不確実である。これらの攻撃がなぜ成功したかを理解することが不可欠である。我々は、モデル表現をどのように作り直すか、そしてモデルが表面形態とは無関係に有害な意図を認識することを学べるかを研究するための機械論的解釈可能性研究の課題を概説する。

論文の概要: From Adversarial Poetry to Adversarial Tales: An Interpretability Research Agenda

関連論文リスト