Fugu-MT 論文翻訳(概要): Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges

論文の概要: Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges

arxiv url: http://arxiv.org/abs/2511.01375v1
Date: Mon, 03 Nov 2025 09:18:27 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-05 16:37:27.197169
Title: Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges
Title（参考訳）: 過失:メタ最適化LDM審査員による自動LDM脱獄
Authors: Hamin Koo, Minseon Kim, Jaehyung Kim,
Abstract要約: 我々は、Jailbreakプロンプトとスコアリングテンプレートを共同で進化させるメタ最適化フレームワークAMISを紹介する。 AMISは最先端のパフォーマンスを実現しており、Claude-3.5-Haikuでは88.0%、Claude-4-Sonnetでは100.0%である。
参考スコア（独自算出の注目度）: 10.382464507264784
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Identifying the vulnerabilities of large language models (LLMs) is crucial for improving their safety by addressing inherent weaknesses. Jailbreaks, in which adversaries bypass safeguards with crafted input prompts, play a central role in red-teaming by probing LLMs to elicit unintended or unsafe behaviors. Recent optimization-based jailbreak approaches iteratively refine attack prompts by leveraging LLMs. However, they often rely heavily on either binary attack success rate (ASR) signals, which are sparse, or manually crafted scoring templates, which introduce human bias and uncertainty in the scoring outcomes. To address these limitations, we introduce AMIS (Align to MISalign), a meta-optimization framework that jointly evolves jailbreak prompts and scoring templates through a bi-level structure. In the inner loop, prompts are refined using fine-grained and dense feedback using a fixed scoring template. In the outer loop, the template is optimized using an ASR alignment score, gradually evolving to better reflect true attack outcomes across queries. This co-optimization process yields progressively stronger jailbreak prompts and more calibrated scoring signals. Evaluations on AdvBench and JBB-Behaviors demonstrate that AMIS achieves state-of-the-art performance, including 88.0% ASR on Claude-3.5-Haiku and 100.0% ASR on Claude-4-Sonnet, outperforming existing baselines by substantial margins.
Abstract（参考訳）: 大きな言語モデル(LLM)の脆弱性を特定することは、固有の弱点に対処することによって、安全性を向上させるために不可欠である。ジャイルブレイク(Jailbreaks)は、敵が侵入プロンプトによって安全ガードをバイパスし、意図しない行動や安全でない行動を誘発するためにLLMを探索することでレッドチームにおいて中心的な役割を果たす。最近の最適化ベースのjailbreakアプローチでは、LLMを活用することで攻撃プロンプトを反復的に洗練する。しかし、それらはしばしば、スパースであるバイナリアタック成功率(ASR)信号または手作業によるスコアリングテンプレートに大きく依存し、スコアリング結果に人間のバイアスと不確実性をもたらす。この制限に対処するため、AMIS (Align to MISalign) というメタ最適化フレームワークを導入しました。インナーループでは、固定されたスコアテンプレートを用いて微粒で高密度なフィードバックを用いてプロンプトを洗練する。外部ループでは、テンプレートはASRアライメントスコアを使用して最適化され、クエリ間の真の攻撃結果をよりよく反映するように徐々に進化する。この共最適化プロセスにより、より強いジェイルブレイクプロンプトとより校正されたスコア信号が得られる。 AdvBench と JBB-Behaviors の評価によると、AMIS は 88.0% ASR on Claude-3.5-Haiku と 100.0% ASR on Claude-4-Sonnet を含む最先端のパフォーマンスを達成しており、既存のベースラインをかなり上回っている。

論文の概要: Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges

関連論文リスト