Fugu-MT 論文翻訳(概要): Broken-Token: Filtering Obfuscated Prompts by Counting Characters-Per-Token

論文の概要: Broken-Token: Filtering Obfuscated Prompts by Counting Characters-Per-Token

arxiv url: http://arxiv.org/abs/2510.26847v1
Date: Thu, 30 Oct 2025 12:42:45 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-03 17:52:15.866994
Title: Broken-Token: Filtering Obfuscated Prompts by Counting Characters-Per-Token
Title（参考訳）: ブローケントークン:文字数による難解なプロンプトのフィルタリング-token
Authors: Shaked Zychlinski, Yuval Kainan,
Abstract要約: 大規模言語モデル(LLM)は、悪意のあるプロンプトが暗号や文字レベルのエンコーディングによって偽装されるジェイルブレイク攻撃の影響を受ける。我々はCPT-Filteringを紹介した。CPT-Filteringは、無視可能なコストとほぼ完全精度のガードレール技術で、モデルに依存しない新しい手法である。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) are susceptible to jailbreak attacks where malicious prompts are disguised using ciphers and character-level encodings to bypass safety guardrails. While these guardrails often fail to interpret the encoded content, the underlying models can still process the harmful instructions. We introduce CPT-Filtering, a novel, model-agnostic with negligible-costs and near-perfect accuracy guardrail technique that aims to mitigate these attacks by leveraging the intrinsic behavior of Byte-Pair Encoding (BPE) tokenizers. Our method is based on the principle that tokenizers, trained on natural language, represent out-of-distribution text, such as ciphers, using a significantly higher number of shorter tokens. Our technique uses a simple yet powerful artifact of using language models: the average number of Characters Per Token (CPT) in the text. This approach is motivated by the high compute cost of modern methods - relying on added modules such as dedicated LLMs or perplexity models. We validate our approach across a large dataset of over 100,000 prompts, testing numerous encoding schemes with several popular tokenizers. Our experiments demonstrate that a simple CPT threshold robustly identifies encoded text with high accuracy, even for very short inputs. CPT-Filtering provides a practical defense layer that can be immediately deployed for real-time text filtering and offline data curation.
Abstract（参考訳）: 大きな言語モデル(LLM)は、悪意のあるプロンプトを暗号や文字レベルのエンコーディングを使って偽装して安全ガードレールをバイパスするジェイルブレイク攻撃の影響を受ける。これらのガードレールは、しばしばエンコードされたコンテンツの解釈に失敗するが、基盤となるモデルは有害な命令を処理することができる。 CPT-Filteringは,バイトペアエンコーディング(BPE)トークンの本質的な動作を活用することにより,これらの攻撃を軽減することを目的とした,無視可能なコストとほぼ完全精度のガードレール技術である。提案手法は, より短いトークン数を用いて, 自然言語で学習したトークン化者が, 暗号などの配布外テキストを表現するという原理に基づいている。提案手法では,テキスト中の文字数あたりの文字数(CPT)という,言語モデルを用いたシンプルな,かつ強力な成果物を用いている。このアプローチは、専用のLLMやパープレキシティモデルなどの追加モジュールに依存して、現代的なメソッドの計算コストが高いことによる。我々は10万以上のプロンプトからなる大規模なデータセットにアプローチを検証し、いくつかの一般的なトークン化器で多数の符号化スキームをテストした。実験により、非常に短い入力であっても、単純なCPTしきい値が高い精度で符号化されたテキストを確実に識別できることが実証された。 CPT-Filteringは、リアルタイムテキストフィルタリングとオフラインデータキュレーションのために即座にデプロイできる実用的な防御層を提供する。

論文の概要: Broken-Token: Filtering Obfuscated Prompts by Counting Characters-Per-Token

関連論文リスト