Fugu-MT 論文翻訳(概要): How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation

論文の概要: How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation

arxiv url: http://arxiv.org/abs/2605.06605v1
Date: Thu, 07 May 2026 17:25:15 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-08 22:27:12.038171
Title: How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation
Title（参考訳）: 脱獄の繰り返しはいくつあるか? マルチターンLDM評価のための動的予算配分
Authors: Shai Feldman, Yaniv Romano,
Abstract要約: 大規模言語モデル(LLM)の多ターン会話設定における重要なイベントは、しばしば繰り返し対話の後にのみ現れる。最近のコンフォメーションサバイバルフレームワークは、関心のイベントをトリガーするイテレーション数に基づいて、信頼できる低い予測境界(LPB)を構築している。 DAPROは,マルチターンインタラクションにおいて,時間と時間の境界を設定するための,理論上有効な動的予算配分フレームワークである。
参考スコア（独自算出の注目度）: 22.523809021772802
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Evaluating and predicting the performance of large language models (LLMs) in multi-turn conversational settings is critical yet computationally expensive; key events -- e.g., jailbreaks or successful task completion by an agent -- often emerge only after repeated interactions. These events might be rare, and under any feasible computational budget, remain unobserved. Recent conformal survival frameworks construct reliable lower predictive bounds (LPBs) on the number of iterations to trigger the event of interest, but rely on static budget allocation that is inefficient in multi-turn setups. To address this, we introduce \emph{Dynamic Allocation via PRojected Optimization} (DAPRO), the first theoretically valid dynamic budget allocation framework for bounding the time-to-event in multi-turn LLM interactions. We prove that DAPRO satisfies the budget constraint and provides distribution-free, finite-sample coverage guarantees without requiring the conditional independence between censoring and event times assumed by prior conformal survival approaches. A key theoretical contribution is a novel coverage bound that scales with the square root of the mean censoring weight rather than the worst-case weight, yielding provably tighter guarantees than prior work. Furthermore, DAPRO can be employed to obtain unbiased, low-variance estimates of population-level evaluation metrics, such as the jailbreak rate, under limited computing resources. Comprehensive experiments across agentic task success, adversarial jailbreaks, toxic content generation, and RAG hallucinations using LLMs such as Llama 3.1 and Qwen 2.5 demonstrate that DAPRO consistently achieves coverage closer to the nominal level with lower variance than static baselines, while satisfying the budget constraint.
Abstract（参考訳）: 大規模な言語モデル(LLM)のパフォーマンスをマルチターンの会話設定で評価し、予測することは、非常に高くつくが、重要なイベント -- 例えば、ジェイルブレイクやエージェントによるタスク完了の成功 – は、繰り返し対話の後にのみ発生することが多い。これらの出来事は稀であり、実現可能な計算予算の下では、観測されないままである。最近のコンフォメーションサバイバルフレームワークは、関心のイベントをトリガーするイテレーション数に基づいて、信頼性の高い低い予測境界(LPB)を構築するが、マルチターンセットアップでは非効率な静的な予算配分に依存している。そこで本研究では,マルチターンLDMインタラクションにおける時間とイベントのバウンディングに有効な動的予算配分フレームワークである,PRojected Optimization} (DAPRO)を紹介した。我々は,DAPROが予算制約を満たすことを証明し,事前のコンフォーマルサバイバルアプローチによって仮定される検閲とイベント時間の条件付き独立性を必要とせず,分布自由で有限サンプルのカバレッジ保証を提供する。重要な理論的貢献は、最悪の場合の重量よりも平均検閲重量の平方根とスケールする、新しいカバレッジ境界である。さらに、DAPROは、限られた計算資源の下で、ジェイルブレイク率などの集団レベルの評価指標の偏りのない低分散推定値を得るために用いられる。 Llama 3.1 や Qwen 2.5 のような LLM を用いたエージェント的タスク成功、敵対的ジェイルブレイク、有害なコンテンツ生成、RAG 幻覚に関する総合的な実験により、DAPRO は、予算制約を満たすとともに、静的ベースラインよりも低いばらつきで、名目レベルに近い範囲を一貫して達成していることが示された。

論文の概要: How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation

関連論文リスト