Fugu-MT 論文翻訳(概要): Off-Distribution Voices: Fanfiction Subgenres as Universal Vernacular Jailbreaks for Aligned LLMs

論文の概要: Off-Distribution Voices: Fanfiction Subgenres as Universal Vernacular Jailbreaks for Aligned LLMs

arxiv url: http://arxiv.org/abs/2606.04483v1
Date: Wed, 03 Jun 2026 06:01:47 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-04 20:44:18.577978
Title: Off-Distribution Voices: Fanfiction Subgenres as Universal Vernacular Jailbreaks for Aligned LLMs
Title（参考訳）: オフ・ディストリビューション・ボイス:アライメントLDMにおけるユニバーサル・バーナキュラー・ジェイルブレイクとしてのファンフィクションサブジャンル
Authors: Zhongze Luo, Ruihe Shi, Zhenshuai Yin, Haoyue Liu, Weixuan Wan, Xiaoying Tang,
Abstract要約: 現実のファンフィクションのサブジャンルをユニバーサルアタックキャリアとして利用する最初のジェイルブレイクファミリーを紹介します。創造的なメタは、12のArchive of Our Own (AO3)サブジャンルの1つのパスで条件付けされる。ハームベンチとジェイルブレイクベンチの合併による8機のLLMでは、この攻撃によりASRは0.278から0.731に上昇した。
参考スコア（独自算出の注目度）: 6.968072313163437
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing jailbreaks against aligned LLMs are discrete artifacts whose surface forms are easy to fingerprint and patch. We argue that the real failure mode is not any specific prompt, but an entire register of natural human writing that safety training has under-covered. Building on this insight, we introduce the first jailbreak family that uses real fanfiction subgenres as universal attack carriers: a creative-writing meta is conditioned on passages from one of twelve Archive of Our Own (AO3) subgenres, and the harmful behavior is embedded as the climax of the resulting scene. The construction requires no attacker LLM and no per-target adaptation. On eight aligned LLMs over the union of HarmBench and JailbreakBench, this attack lifts mean ASR from 0.278 to 0.731 under a four-judge ensemble; a factorial decomposition shows the gain is carried by register rather than length or structure. Two active defences widen rather than narrow the vernacular-to-baseline ratio, indicating that template-targeting defences merely steer attackers toward register-based attacks like ours. We also propose SAGA-A4, a static four-turn extension that attains mean ASR 0.924, substantially exceeding three existing multi-turn methods.
Abstract（参考訳）: 既存のLDMに対するジェイルブレイクは、表面形状が指紋やパッチが容易な離散的なアーティファクトである。実際の障害モードは、特定のプロンプトではなく、安全トレーニングが未発見である、という自然な人間の記述のレジスタ全体である、と我々は主張する。この知見に基づき、本研究では、現実のファンフィクションのサブジャンルをユニバーサルな攻撃キャリアとして利用する最初のジェイルブレイクファミリーを紹介し、創造的なメタは、12のArchive of Our Own(AO3)サブジャンルの1つのパスに条件付けされ、有害な振る舞いは、結果のクライマックスとして埋め込まれている。攻撃的なLLMを必要とせず、ターゲットごとの適応も必要としない。 HarmBench と JailbreakBench の合併による 8 つの LLM において、この攻撃は ASR を 0.278 から 0.731 まで 4 桁のアンサンブルで上昇させる。 2つのアクティブディフェンスは、バーナキュラーとベースラインの比率を狭めるのではなく、拡張され、テンプレートをターゲットとしたディフェンスは、我々のようなレジスタベースのアタックに対して単なるステア・アタックであることを示している。また、SAGA-A4は、平均 ASR 0.924 に達し、既存の 3 つのマルチターン法を大幅に超える静的 4 ターン拡張を提案する。

論文の概要: Off-Distribution Voices: Fanfiction Subgenres as Universal Vernacular Jailbreaks for Aligned LLMs

関連論文リスト