Fugu-MT 論文翻訳(概要): Adaptive Instruction Composition for Automated LLM Red-Teaming

論文の概要: Adaptive Instruction Composition for Automated LLM Red-Teaming

arxiv url: http://arxiv.org/abs/2604.21159v1
Date: Wed, 22 Apr 2026 23:55:32 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-24 14:40:06.218383
Title: Adaptive Instruction Composition for Automated LLM Red-Teaming
Title（参考訳）: 自動LLMリレーティングのための適応的インストラクション構成
Authors: Jesse Zymet, Andy Luo, Swapnil Shinde, Sahil Wadhwa, Emily Chen,
Abstract要約: 本稿では、クラウドソースされたテキストを多様性と共同で効率を最適化するように訓練された適応メカニズムに従って組み合わせた新しいフレームワーク、Adaptive Instruction Compositionを紹介する。本手法は,モデル転送下であっても,一組の有効性と多様性の指標において,ランダムな組み合わせを著しく上回ることを示す。我々は、コントラスト的な埋め込み入力に適応する軽量なニューラルネットワークコンテキストバンドレットを使用し、コントラスト的な事前学習によってネットワークが学習する巨大な空間に素早く一般化し、拡張できることを示す。
参考スコア（独自算出の注目度）: 0.8369173719399807
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Many approaches to LLM red-teaming leverage an attacker LLM to discover jailbreaks against a target. Several of them task the attacker with identifying effective strategies through trial and error, resulting in a semantically limited range of successes. Another approach discovers diverse attacks by combining crowdsourced harmful queries and tactics into instructions for the attacker, but does so at random, limiting effectiveness. This article introduces a novel framework, Adaptive Instruction Composition, that combines crowdsourced texts according to an adaptive mechanism trained to jointly optimize effectiveness with diversity. We use reinforcement learning to balance exploration with exploitation in a combinatorial space of instructions to guide the attacker toward diverse generations tailored to target vulnerabilities. We demonstrate that our approach substantially outperforms random combination on a set of effectiveness and diversity metrics, even under model transfer. Further, we show that it surpasses a host of recent adaptive approaches on Harmbench. We employ a lightweight neural contextual bandit that adapts to contrastive embedding inputs, and provide ablations suggesting that the contrastive pretraining enables the network to rapidly generalize and scale to the massive space as it learns.
Abstract（参考訳）: LLMのリピートに対する多くのアプローチは、ターゲットに対するジェイルブレイクを発見するために攻撃的なLLMを活用する。いくつかは、試行錯誤を通じて効果的な戦略を特定するよう攻撃者に命じる。別のアプローチは、クラウドソースされた有害なクエリと戦術をアタッカーの指示に組み合わせることで多様な攻撃を発見するが、ランダムに実行し、有効性を制限している。本稿では、クラウドソースされたテキストを多様性と共同で効率を最適化するように訓練された適応メカニズムに従って組み合わせた新しいフレームワーク、Adaptive Instruction Compositionを紹介する。我々は、強化学習を用いて、攻撃者を標的とする様々な世代に向けて誘導する、複合的な命令空間におけるエクスプロイトによる探索のバランスをとる。本手法は,モデル転送下であっても,一組の有効性と多様性の指標において,ランダムな組み合わせを著しく上回ることを示す。さらに、最近のHarmbenchに対する適応的アプローチのホストを超越していることが示される。我々は、コントラスト的な埋め込み入力に適応する軽量なニューラルネットワークコンテキストバンディットを採用し、コントラスト的な事前学習によってネットワークが学習する巨大な空間に迅速に一般化しスケールできることを示す。

論文の概要: Adaptive Instruction Composition for Automated LLM Red-Teaming

関連論文リスト