Fugu-MT 論文翻訳(概要): RECAP: A Resource-Efficient Method for Adversarial Prompting in Large Language Models

論文の概要: RECAP: A Resource-Efficient Method for Adversarial Prompting in Large Language Models

arxiv url: http://arxiv.org/abs/2601.15331v1
Date: Tue, 20 Jan 2026 06:01:02 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-23 21:37:20.355054
Title: RECAP: A Resource-Efficient Method for Adversarial Prompting in Large Language Models
Title（参考訳）: RECAP:大規模言語モデルにおける逆プロンプティングのための資源効率の良い手法
Authors: Rishit Chugh,
Abstract要約: 本稿では,事前学習した相手プロンプトのデータベースに新たなプロンプトをマッチングすることで,リトレーニングの必要性を解消する資源効率の高い逆プロンプト手法を提案する。セマンティックに類似した敵のプロンプトを抽出することにより,計算コストを大幅に削減した競合攻撃成功率を実現する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The deployment of large language models (LLMs) has raised security concerns due to their susceptibility to producing harmful or policy-violating outputs when exposed to adversarial prompts. While alignment and guardrails mitigate common misuse, they remain vulnerable to automated jailbreaking methods such as GCG, PEZ, and GBDA, which generate adversarial suffixes via training and gradient-based search. Although effective, these methods particularly GCG are computationally expensive, limiting their practicality for organisations with constrained resources. This paper introduces a resource-efficient adversarial prompting approach that eliminates the need for retraining by matching new prompts to a database of pre-trained adversarial prompts. A dataset of 1,000 prompts was classified into seven harm-related categories, and GCG, PEZ, and GBDA were evaluated on a Llama 3 8B model to identify the most effective attack method per category. Results reveal a correlation between prompt type and algorithm effectiveness. By retrieving semantically similar successful adversarial prompts, the proposed method achieves competitive attack success rates with significantly reduced computational cost. This work provides a practical framework for scalable red-teaming and security evaluation of aligned LLMs, including in settings where model internals are inaccessible.
Abstract（参考訳）: 大規模言語モデル(LLM)の展開は、敵のプロンプトに晒された場合、有害またはポリシー違反のアウトプットを発生させる可能性があるため、セキュリティ上の懸念を提起している。アライメントやガードレールは一般的な誤用を緩和するが、GCG、PEZ、GBDAなどの自動ジェイルブレイク手法に弱いままであり、トレーニングや勾配に基づく探索を通じて敵の接尾辞を生成する。有効ではあるが、これらの手法、特にGCGは計算コストが高く、制約のあるリソースを持つ組織に対して実用性を制限している。本稿では,事前学習した相手プロンプトのデータベースに新たなプロンプトをマッチングすることで,再トレーニングの必要性を解消する資源効率の高い逆プロンプト手法を提案する。 7つの危険関連カテゴリに1,000のプロンプトのデータセットを分類し,GCG,PEZ,GBDAをLlama 3 8Bモデルで評価し,最も効果的な攻撃方法を特定した。その結果,プロンプト型とアルゴリズムの有効性の相関が明らかになった。セマンティックに類似した敵のプロンプトを抽出することにより,計算コストを大幅に削減した競合攻撃成功率を実現する。この作業は、モデル内部がアクセス不能な設定を含む、整列 LLM のスケーラブルな再チーム化とセキュリティ評価のための実践的なフレームワークを提供する。

論文の概要: RECAP: A Resource-Efficient Method for Adversarial Prompting in Large Language Models

関連論文リスト