Fugu-MT 論文翻訳(概要): SRTJ: Self-Evolving Rule-Driven Training-Free LLM Jailbreaking

論文の概要: SRTJ: Self-Evolving Rule-Driven Training-Free LLM Jailbreaking

arxiv url: http://arxiv.org/abs/2605.00974v1
Date: Fri, 01 May 2026 17:27:45 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-05 20:33:49.526552
Title: SRTJ: Self-Evolving Rule-Driven Training-Free LLM Jailbreaking
Title（参考訳）: SRTJ: 自己進化型ルール駆動トレーニングフリーLLM脱獄
Authors: Jindong Li, Ying Liu, Yali Fu, Jinjing Zhu, Leyao Wang, Menglin Yang, Rex Ying,
Abstract要約: 我々は、攻撃戦略を体系的に発見、構成、洗練する自己進化型ルール駆動型トレーニングフリージェイルブレイク(SRTJ)フレームワークを提案する。結果として生じるルールメモリは階層的なマルチレベルな方法で進化し、蒸留された攻撃知識を長期的、中期的、短期的なルールに明示的に整理する。 SRTJは、既存のjailbreak法と比較して、一般化と堅牢性を向上しつつ、異なる目標LLMに対して、強力で安定した攻撃性能を実現する。
参考スコア（独自算出の注目度）: 24.752522468137443
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: LLMs are increasingly equipped with safety alignment mechanisms, yet recent studies demonstrate that they remain vulnerable to jailbreaking attacks that elicit harmful behaviors without explicit policy violations. While a growing body of work has explored automated jailbreak strategies, existing methods face several fundamental challenges, including the lack of systematic utilization of both successful and failed attack experiences, as well as the absence of principled mechanisms for composing and selecting reusable attack rules under diverse constraints. As a result, existing methods struggle to accumulate transferable knowledge over time and to reliably adapt attack strategies across different targets and evolving safety mechanisms. To address these issues, we propose a Self-Evolving Rule-Driven Training-Free Jailbreak (SRTJ) framework that systematically discovers, composes, and refines attack strategies through interaction and feedback, without updating model parameters. Specifically, SRTJ couples experience-driven attack generation with answer set programming (ASP)-based rule selection and constraint-aware composition, where iterative verifier feedback is leveraged to jointly refine successful strategies and analyze failure patterns. The resulting rule memory evolves in a hierarchical multi-level manner, explicitly organizing distilled attack knowledge into long-term, middle-term, and short-term rules, thereby capturing both stable transferable strategies and transient adaptive behaviors to effectively balance exploration and exploitation across attack attempts. Extensive experiments on mainstream jailbreak benchmark (HarmBench) demonstrate that SRTJ achieves strong and stable attack performance across different target LLMs, while exhibiting improved robustness and generalization compared to existing jailbreak methods. The code is available at https://github.com/TheSolkatt/SRTJ.
Abstract（参考訳）: LLMには安全アライメント機構がますます備わっているが、最近の研究では、明示的なポリシー違反なしに有害な行動を引き起こすジェイルブレイク攻撃に対して脆弱であることが証明されている。自動化されたジェイルブレイク戦略を探求する一方で、既存の手法では、成功と失敗の両方の攻撃経験の体系的利用の欠如や、さまざまな制約の下で再利用可能な攻撃ルールの作成と選択のための原則的なメカニズムの欠如など、いくつかの根本的な課題に直面している。その結果、既存の手法では、時間とともに移動可能な知識を蓄積し、異なる目標をまたいだ攻撃戦略を確実に適応し、安全メカニズムの進化に苦慮している。これらの問題に対処するために、モデルパラメータを更新することなく、相互作用やフィードバックを通じて攻撃戦略を体系的に発見、構成、洗練する自己進化型ルール駆動型トレーニングフリー・ジェイルブレイク(SRTJ)フレームワークを提案する。具体的には、SRTJは、経験駆動アタック生成と、応答セットプログラミング(ASP)ベースのルール選択と制約認識コンポジションを結合し、反復検証フィードバックを利用して、成功戦略を共同で洗練し、失敗パターンを分析する。得られたルールメモリは階層的な多段階的に進化し、蒸留された攻撃知識を長期的、中期的、短期的なルールに明示的に整理し、安定した移動可能な戦略と過渡的な適応的行動の両方を捕捉し、攻撃の試み間の探索と搾取を効果的にバランスさせる。主流のjailbreakベンチマーク(HarmBench)での大規模な実験により、SRTJは、既存のjailbreak手法と比較して堅牢性や一般化が向上しつつ、異なるターゲットLLMに対して、強力で安定した攻撃性能を達成することが示された。コードはhttps://github.com/TheSolkatt/SRTJ.comで公開されている。

論文の概要: SRTJ: Self-Evolving Rule-Driven Training-Free LLM Jailbreaking

関連論文リスト