Fugu-MT 論文翻訳(概要): LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models

論文の概要: LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models

arxiv url: http://arxiv.org/abs/2605.21362v1
Date: Wed, 20 May 2026 16:27:00 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-21 19:19:56.779105
Title: LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models
Title（参考訳）: LASH: 大規模言語モデルのブラックボックスジェイルブレークのための適応的セマンティックハイブリダイゼーション
Authors: Abdullah Al Nomaan Nafi, Fnu Suya, Swarup Bhunia, Prabuddha Chakraborty,
Abstract要約: ジェイルブレイク攻撃は、アライメントされた大きな言語モデルの意図された安全行動と、敵対的なプロンプトの下でのそれらの行動の間に永続的なギャップを露呈する。我々は、複数のベースアタックからの出力を再利用可能なシードプロンプトとして扱うブラックボックスフレームワークであるLASH(LLM Adaptive Semantic Hybridization)を導入し、ターゲット要求毎に適応的に構成する。 10のカテゴリーに100の有害なプロンプトを含むJailbreakBenchでは、6つの共通ターゲットモデルでLASHを評価し、キーワードベースの評価では平均攻撃成功率84.5%、二段階評価では74.5%と評価した。
参考スコア（独自算出の注目度）: 8.091700349640835
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Jailbreak attacks expose a persistent gap between the intended safety behavior of aligned large language models and their behavior under adversarial prompting. Existing automated methods are increasingly effective but each commits to a single attack family (e.g., one refinement loop, one tree search, one mutation space, or one strategy library) and no single family dominates: the best-performing method shifts across target models and harm categories, suggesting complementary strengths that per-prompt composition could exploit. We introduce LASH (LLM Adaptive Semantic Hybridization), a black-box framework that treats outputs from multiple base attacks as reusable seed prompts and adaptively composes them for each target request. Given a seed pool, LASH searches over seed subsets and softmax-normalized mixture weights; a composition module synthesizes a single candidate prompt, and a derivative-free genetic optimizer updates the weights using black-box target feedback and a two-stage fitness function combining keyword-based refusal detection with LLM-judge scoring. On JailbreakBench, which contains 100 harmful prompts across 10 categories, we evaluate LASH on six common target models. LASH achieves an average attack success rate of 84.5% under keyword-based evaluation and 74.5% under two-stage evaluation, where responses are first filtered for refusals and then scored by an LLM judge for whether they substantively fulfill the original harmful request. LASH outperforms five state-of-the-art baselines on both metrics with only 30 mean target queries. LASH also remains competitive under three defense mechanisms and induces more success-like internal representations. These results suggest that adaptive composition across heterogeneous jailbreak strategies is a promising direction for black-box red-teaming.
Abstract（参考訳）: ジェイルブレイク攻撃は、アライメントされた大きな言語モデルの意図された安全行動と、敵対的なプロンプトの下でのそれらの行動の間に永続的なギャップを露呈する。既存の自動メソッドはますます効果的になるが、それぞれのコミットは1つの攻撃ファミリー(例えば、1つのリファインメントループ、1つのツリー検索、1つの突然変異スペース、または1つの戦略ライブラリ)であり、単一のファミリーが支配的である。我々は、複数のベースアタックからの出力を再利用可能なシードプロンプトとして扱うブラックボックスフレームワークであるLASH(LLM Adaptive Semantic Hybridization)を導入し、ターゲット要求毎に適応的に構成する。種プールが与えられた場合、LASHは種子サブセットとソフトマックス正規化混合重量を探索し、合成モジュールは単一の候補プロンプトを合成し、誘導体フリーな遺伝的オプティマイザはブラックボックスターゲットフィードバックとキーワードベースの拒絶検出とLLMジャッジスコアを組み合わせた2段階の適合機能を用いて重量を更新する。 10のカテゴリに100の有害なプロンプトを含む JailbreakBench について,6つの共通ターゲットモデル上でLASHを評価した。 LASHはキーワードベースの評価で平均84.5%、二段階評価で平均74.5%の攻撃成功率を達成する。 LASHは両方のメトリクスで5つの最先端のベースラインを上回り、ターゲットクエリの平均は30である。 LASHは3つの防御機構の下で競争力を維持し、より成功的な内部表現を誘導する。これらの結果から, 異種ジェイルブレイク戦略の適応構成がブラックボックスのレッドチームにとって有望な方向であることが示唆された。

論文の概要: LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models

関連論文リスト