Fugu-MT 論文翻訳(概要): Active Attacks: Red-teaming LLMs via Adaptive Environments

論文の概要: Active Attacks: Red-teaming LLMs via Adaptive Environments

arxiv url: http://arxiv.org/abs/2509.21947v1
Date: Fri, 26 Sep 2025 06:27:00 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-29 20:57:54.239334
Title: Active Attacks: Red-teaming LLMs via Adaptive Environments
Title（参考訳）: アクティブアタック:アダプティブ環境を経由したLLMのリピート
Authors: Taeyoung Yun, Pierre-Luc St-Charles, Jinkyoo Park, Yoshua Bengio, Minsu Kim,
Abstract要約: 大規模言語モデル(LLM)に対する多様な攻撃プロンプトを生成するという課題に対処する。我々は、犠牲者が進化するにつれて攻撃に適応する新しいRLベースのレッドチームアルゴリズムであるtextitActive Attacksを導入する。
参考スコア（独自算出の注目度）: 71.55110023234376
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We address the challenge of generating diverse attack prompts for large language models (LLMs) that elicit harmful behaviors (e.g., insults, sexual content) and are used for safety fine-tuning. Rather than relying on manual prompt engineering, attacker LLMs can be trained with reinforcement learning (RL) to automatically generate such prompts using only a toxicity classifier as a reward. However, capturing a wide range of harmful behaviors is a significant challenge that requires explicit diversity objectives. Existing diversity-seeking RL methods often collapse to limited modes: once high-reward prompts are found, exploration of new regions is discouraged. Inspired by the active learning paradigm that encourages adaptive exploration, we introduce \textit{Active Attacks}, a novel RL-based red-teaming algorithm that adapts its attacks as the victim evolves. By periodically safety fine-tuning the victim LLM with collected attack prompts, rewards in exploited regions diminish, which forces the attacker to seek unexplored vulnerabilities. This process naturally induces an easy-to-hard exploration curriculum, where the attacker progresses beyond easy modes toward increasingly difficult ones. As a result, Active Attacks uncovers a wide range of local attack modes step by step, and their combination achieves wide coverage of the multi-mode distribution. Active Attacks, a simple plug-and-play module that seamlessly integrates into existing RL objectives, unexpectedly outperformed prior RL-based methods -- including GFlowNets, PPO, and REINFORCE -- by improving cross-attack success rates against GFlowNets, the previous state-of-the-art, from 0.07% to 31.28% (a relative gain greater than $400\ \times$) with only a 6% increase in computation. Our code is publicly available \href{https://github.com/dbsxodud-11/active_attacks}{here}.
Abstract（参考訳）: 我々は、有害な行動(例えば、侮辱、性的内容)を誘発し、安全性を高めるために使用される、大規模言語モデル(LLM)に対する多様な攻撃プロンプトを生成するという課題に対処する。手動のプロンプト工学に頼るのではなく、攻撃的LSMは強化学習(RL)を用いて訓練され、毒性分類器のみを報酬として自動生成する。しかし、幅広い有害な行動を捉えることは、明確な多様性の目標を必要とする重要な課題である。既存の多様性を追求するRL法は、しばしば制限モードに崩壊する。適応探索を促進する能動的学習パラダイムに着想を得て,被害者が進化するにつれて攻撃に適応する新しいRLベースのレッドチームアルゴリズムである「textit{Active Attacks}」を導入する。攻撃プロンプトを収集して被害者のLSMを定期的に微調整することで、悪用された地域の報酬は減少し、攻撃者は未発見の脆弱性を探さざるを得なくなる。このプロセスは、容易でハードな探索カリキュラムを自然に引き起こし、攻撃者は容易なモードを超えて、ますます難しいものへと進む。その結果、アクティブアタックは、ステップごとに幅広いローカルアタックモードを明らかにし、それらの組み合わせはマルチモード分布を広範囲にカバーする。従来の最先端のGFlowNetsに対するクロスアタック成功率を0.07%から31.28%(400\\\times$以上の相対的な増加)に改善することで、GFlowNets、PPO、REINFORCEなど、従来のRLベースのメソッドよりも予想外のパフォーマンス向上を実現した。当社のコードは公開されており、 https://github.com/dbsxodud-11/active_ attacks}{here} です。

論文の概要: Active Attacks: Red-teaming LLMs via Adaptive Environments

関連論文リスト