Fugu-MT 論文翻訳(概要): AutoRISE: Agent-Driven Strategy Evolution for Red-Teaming Large Language Models

論文の概要: AutoRISE: Agent-Driven Strategy Evolution for Red-Teaming Large Language Models

arxiv url: http://arxiv.org/abs/2604.22871v1
Date: Thu, 23 Apr 2026 19:37:48 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-28 17:12:07.015938
Title: AutoRISE: Agent-Driven Strategy Evolution for Red-Teaming Large Language Models
Title（参考訳）: AutoRISE: 大規模言語モデルの再編成のためのエージェント駆動戦略進化
Authors: Tanmay Gautam, Alireza Bahramali, Sandeep Atluri,
Abstract要約: 本稿では,個々のプロンプトではなく,実行可能攻撃プログラムを検索するAutoRISEを提案する。各イテレーションにおいて、コーディングエージェントが戦略を編集し、固定評価ハーネスが結果の攻撃をスコアする。 AutoRISEはブラックボックス、推論のみの設定で動作し、微調整、ヒューマンアノテーション、GPU計算を必要としない。
参考スコア（独自算出の注目度）: 2.75206475271089
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automated red-teaming methods for large language models typically optimize attack prompts within a fixed, human-designed strategy, leaving the attack strategy itself unchanged. We instead optimize the strategy. We propose AutoRISE, a method that searches over executable attack programs rather than individual prompts. At each iteration, a coding agent edits a strategy and a fixed evaluation harness scores the resulting attacks, returning both a scalar objective and per-example diagnostics that guide subsequent edits. This allows structural changes, including new attack components and altered control flow, that prompt-level methods do not directly express. We also release two benchmark suites developed on disjoint target sets and evaluate on 11 models from five families against seven established jailbreak datasets. Across held-out models, AutoRISE improves average attack success rate by 17.0 points over the strongest baseline, and improves attack success by up to 16 points on frontier targets with low baseline success rates. Ablations against parametric and strategy-library baselines suggest that these gains arise from unrestricted program search, particularly compositional techniques and control-flow edits. AutoRISE operates in a black-box, inference-only setting, requiring no fine-tuning, human annotation, or GPU compute.
Abstract（参考訳）: 大規模言語モデルのための自動赤チーム方式は、通常、固定された人間設計の戦略の中で攻撃プロンプトを最適化し、攻撃戦略自体が変わらないままにしておく。代わりに戦略を最適化します。本稿では,個々のプロンプトではなく,実行可能攻撃プログラムを検索するAutoRISEを提案する。各イテレーションにおいて、コーディングエージェントが戦略を編集し、固定評価ハーネスが結果の攻撃をスコアし、その後の編集を導くスカラー目的とサンプルごとの診断の両方を返す。これにより、新しいアタックコンポーネントや変更されたコントロールフローを含む構造的な変更が可能になり、プロンプトレベルのメソッドは直接表現しない。また、不整合ターゲットセット上で開発された2つのベンチマークスイートもリリースし、確立された7つのjailbreakデータセットに対して、5つのファミリーから11のモデルで評価する。ホールドアウトモデル全体では、AutoRISEは最強のベースラインで平均攻撃成功率を17.0ポイント改善し、ベースラインの成功率の低いフロンティア目標で最大16ポイント向上する。パラメトリックおよび戦略ライブラリーベースラインに対するアブレーションは、これらの利得が制限されないプログラム探索、特に構成技術と制御フローの編集から生じることを示唆している。 AutoRISEはブラックボックス、推論のみの設定で動作し、微調整、ヒューマンアノテーション、GPU計算を必要としない。

論文の概要: AutoRISE: Agent-Driven Strategy Evolution for Red-Teaming Large Language Models

関連論文リスト