Fugu-MT 論文翻訳(概要): Token-Level Adversarial Prompt Detection Based on Perplexity Measures and Contextual Information

論文の概要: Token-Level Adversarial Prompt Detection Based on Perplexity Measures and Contextual Information

arxiv url: http://arxiv.org/abs/2311.11509v3
Date: Sun, 18 Feb 2024 06:04:27 GMT
ステータス: 翻訳完了
システム内更新日: 2024-02-21 04:32:04.313987
Title: Token-Level Adversarial Prompt Detection Based on Perplexity Measures and Contextual Information
Title（参考訳）: 難易度対策と文脈情報に基づくToken-Level Adversarial Prompt Detection
Authors: Zhengmian Hu, Gang Wu, Saayan Mitra, Ruiyi Zhang, Tong Sun, Heng Huang, and Viswanathan Swaminathan
Abstract要約: 大規模言語モデルは、敵の迅速な攻撃に影響を受けやすい。この脆弱性は、LLMの堅牢性と信頼性に関する重要な懸念を浮き彫りにしている。トークンレベルで敵のプロンプトを検出するための新しい手法を提案する。
参考スコア（独自算出の注目度）: 67.78183175605761
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: In recent years, Large Language Models (LLM) have emerged as pivotal tools in various applications. However, these models are susceptible to adversarial prompt attacks, where attackers can carefully curate input strings that mislead LLMs into generating incorrect or undesired outputs. Previous work has revealed that with relatively simple yet effective attacks based on discrete optimization, it is possible to generate adversarial prompts that bypass moderation and alignment of the models. This vulnerability to adversarial prompts underscores a significant concern regarding the robustness and reliability of LLMs. Our work aims to address this concern by introducing a novel approach to detecting adversarial prompts at a token level, leveraging the LLM's capability to predict the next token's probability. We measure the degree of the model's perplexity, where tokens predicted with high probability are considered normal, and those exhibiting high perplexity are flagged as adversarial. Additionaly, our method also integrates context understanding by incorporating neighboring token information to encourage the detection of contiguous adversarial prompt sequences. To this end, we design two algorithms for adversarial prompt detection: one based on optimization techniques and another on Probabilistic Graphical Models (PGM). Both methods are equipped with efficient solving methods, ensuring efficient adversarial prompt detection. Our token-level detection result can be visualized as heatmap overlays on the text sequence, allowing for a clearer and more intuitive representation of which part of the text may contain adversarial prompts.
Abstract（参考訳）: 近年,様々なアプリケーションにおいて,Large Language Models (LLM) が重要なツールとして登場している。しかし、これらのモデルは敵のプロンプト攻撃の影響を受けやすいため、攻撃者はLSMを誤る入力文字列を慎重にキュレートし、誤った出力や望ましくない出力を生成することができる。従来の研究によると、離散最適化に基づく比較的単純な効果的な攻撃では、モデルのモデレーションやアライメントをバイパスする逆のプロンプトを生成することができる。敵に対するこの脆弱性は、LSMの堅牢性と信頼性に関する重要な懸念を浮き彫りにする。本研究の目的は,次のトークンの確率を予測するLLMの能力を活用して,トークンレベルでの敵対的プロンプトの検出に新たなアプローチを導入することである。本研究では,高い確率で予測されるトークンが正規であり,高いパープレキシティを示すトークンが逆数としてフラグ付けされるような,モデルのパープレキシティの度合いを測定する。さらに,提案手法では,隣接トークン情報を組み込んだコンテキスト理解も統合し,連続した敵のプロンプトシーケンスの検出を促進する。この目的のために、最適化手法に基づく2つのアルゴリズムと確率的グラフィカルモデル(PGM)に基づく2つのアルゴリズムを設計する。どちらの手法も効率的な解法を備えており、効率のよい逆数検出が可能である。トークンレベルの検出結果は、テキストシーケンス上のヒートマップオーバーレイとして可視化でき、テキストのどの部分が逆プロンプトを含んでいるかを明確により直感的に表現することができます。

論文の概要: Token-Level Adversarial Prompt Detection Based on Perplexity Measures and Contextual Information

関連論文リスト