Fugu-MT 論文翻訳(概要): TEMPLATEFUZZ: Fine-Grained Chat Template Fuzzing for Jailbreaking and Red Teaming LLMs

論文の概要: TEMPLATEFUZZ: Fine-Grained Chat Template Fuzzing for Jailbreaking and Red Teaming LLMs

arxiv url: http://arxiv.org/abs/2604.12232v1
Date: Tue, 14 Apr 2026 03:12:19 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-15 19:11:32.214193
Title: TEMPLATEFUZZ: Fine-Grained Chat Template Fuzzing for Jailbreaking and Red Teaming LLMs
Title（参考訳）: TEMPLATEFUZZ: JailbreakとRed Teaming LLMのためのファイングレードチャットテンプレートファジング
Authors: Qingchao Shen, Zibo Xiao, Lili Huang, Enwei Hu, Yongqiang Tian, Junjie Chen,
Abstract要約: 大きな言語モデル(LLM)は、さまざまなドメインにまたがるデプロイが増えているが、Jailbreak攻撃に対する脆弱性は、重大なセキュリティリスクを引き起こす。本稿では,チャットテンプレートの脆弱性を体系的に公開するファジィフレームワークであるFUZZを紹介する。
参考スコア（独自算出の注目度）: 9.50424979744786
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) are increasingly deployed across diverse domains, yet their vulnerability to jailbreak attacks, where adversarial inputs bypass safety mechanisms to elicit harmful outputs, poses significant security risks. While prior work has primarily focused on prompt injection attacks, these approaches often require resource-intensive prompt engineering and overlook other critical components, such as chat templates. This paper introduces TEMPLATEFUZZ, a fine-grained fuzzing framework that systematically exposes vulnerabilities in chat templates, a critical yet underexplored attack surface in LLMs. Specifically, TEMPLATEFUZZ (1) designs a series of element-level mutation rules to generate diverse chat template variants, (2) proposes a heuristic search strategy to guide the chat template generation toward the direction of amplifying the attack success rate (ASR) while preserving model accuracy, and (3) integrates an active learning-based strategy to derive a lightweight rule-based oracle for accurate and efficient jailbreak evaluation. Evaluated on twelve open-source LLMs across multiple attack scenarios, TEMPLATEFUZZ achieves an average ASR of 98.2% with only 1.1% accuracy degradation, outperforming state-of-the-art methods by 9.1%-47.9% in ASR and 8.4% in accuracy degradation. Moreover, even on five industry-leading commercial LLMs where chat templates cannot be specified, TEMPLATEFUZZ attains a 90% average ASR via chat template-based prompt injection attacks.
Abstract（参考訳）: 大きな言語モデル(LLM)は、さまざまなドメインにまたがるデプロイが増えているが、Jailbreak攻撃に対する脆弱性は、有害なアウトプットを引き出すための安全メカニズムを回避し、重大なセキュリティリスクを生じさせる。以前の作業は主にインジェクション攻撃に重点を置いていたが、これらのアプローチはリソース集約的なプロンプトエンジニアリングを必要とし、チャットテンプレートのような他の重要なコンポーネントを見落としていることが多い。本稿では、チャットテンプレートの脆弱性を体系的に公開する、きめ細かなファジィフレームワークであるTEMPLATEFUZ(TEMPLATEFUZ)を紹介する。具体的には、TEMPLATEFUZ (1) は、様々なチャットテンプレートの変種を生成するための要素レベル変異ルールを設計し、(2) モデル精度を維持しながら、攻撃成功率(ASR)を増幅する方向にチャットテンプレート生成を誘導するヒューリスティックな探索戦略を提案し、(3) より軽量で効率的なジェイルブレイク評価のためのルールベースのオラクルを導出するためのアクティブラーニングベースの戦略を統合する。 TEMPLATEFUZは、複数の攻撃シナリオにわたる12のオープンソースLLMで評価され、平均的なASRは98.2%、精度はわずか1.1%、最先端の手法は9.1%-47.9%、精度は8.4%である。さらに、チャットテンプレートを指定できない業界主導の商用LLMでは、TEMPLATEFUZはチャットテンプレートベースのプロンプトインジェクション攻撃によって平均90%のASRを達成する。

論文の概要: TEMPLATEFUZZ: Fine-Grained Chat Template Fuzzing for Jailbreaking and Red Teaming LLMs

関連論文リスト