Fugu-MT 論文翻訳(概要): MetaBreak: Jailbreaking Online LLM Services via Special Token Manipulation

論文の概要: MetaBreak: Jailbreaking Online LLM Services via Special Token Manipulation

arxiv url: http://arxiv.org/abs/2510.10271v1
Date: Sat, 11 Oct 2025 16:14:56 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-14 18:06:29.868823
Title: MetaBreak: Jailbreaking Online LLM Services via Special Token Manipulation
Title（参考訳）: MetaBreak: 特別なトークン操作によるオンラインLLMサービスの脱獄
Authors: Wentian Zhu, Zhen Xiang, Wei Niu, Le Guan,
Abstract要約: 大規模言語モデルの微調整プロセス中に、構造化された会話に注釈を付けるために特別なトークンが作成されます。攻撃プリミティブを4つ構築するために特別なトークンを利用することができることを示す。本手法は,コンテンツモデレーションが展開されない場合,SOTAプロンプトエンジニアリングソリューションに匹敵するジェイルブレイク率を実現する。
参考スコア（独自算出の注目度）: 16.48157553847625
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Unlike regular tokens derived from existing text corpora, special tokens are artificially created to annotate structured conversations during the fine-tuning process of Large Language Models (LLMs). Serving as metadata of training data, these tokens play a crucial role in instructing LLMs to generate coherent and context-aware responses. We demonstrate that special tokens can be exploited to construct four attack primitives, with which malicious users can reliably bypass the internal safety alignment of online LLM services and circumvent state-of-the-art (SOTA) external content moderation systems simultaneously. Moreover, we found that addressing this threat is challenging, as aggressive defense mechanisms-such as input sanitization by removing special tokens entirely, as suggested in academia-are less effective than anticipated. This is because such defense can be evaded when the special tokens are replaced by regular ones with high semantic similarity within the tokenizer's embedding space. We systemically evaluated our method, named MetaBreak, on both lab environment and commercial LLM platforms. Our approach achieves jailbreak rates comparable to SOTA prompt-engineering-based solutions when no content moderation is deployed. However, when there is content moderation, MetaBreak outperforms SOTA solutions PAP and GPTFuzzer by 11.6% and 34.8%, respectively. Finally, since MetaBreak employs a fundamentally different strategy from prompt engineering, the two approaches can work synergistically. Notably, empowering MetaBreak on PAP and GPTFuzzer boosts jailbreak rates by 24.3% and 20.2%, respectively.
Abstract（参考訳）: 既存のテキストコーパスから派生した通常のトークンとは異なり、LLM(Large Language Models)の微調整過程において、構造化された会話に注釈を付けるために特別なトークンが人工的に作成される。トレーニングデータのメタデータとして機能するこれらのトークンは、コヒーレントでコンテキスト対応の応答を生成するためにLLMに指示する上で重要な役割を果たす。攻撃プリミティブを4つ構築するために特別なトークンを利用でき、悪意のあるユーザはオンラインLLMサービスの内部安全アライメントを確実に回避し、SOTA(State-of-the-art)外部コンテンツモデレーションシステムを同時に回避できる。さらに,この脅威に対処する上で,特別なトークンを完全に取り除くことで,入力衛生化などの攻撃的な防御機構が期待するよりも効果が低いことが判明した。これは、特別なトークンがトークンの埋め込み空間内で高い意味的類似性を持つ通常のトークンに置き換えられたとき、そのような防御を回避することができるためである。実験室環境と商用LLMプラットフォームの両方で,MetaBreakという手法を体系的に評価した。本手法は,コンテンツモデレーションが展開されない場合,SOTAプロンプトエンジニアリングベースのソリューションに匹敵するジェイルブレイク率を実現する。しかし、コンテンツモデレーションがある場合、MetaBreakはSOTAソリューションのPAPとGPTFuzzerをそれぞれ11.6%、GPTFuzzerは34.8%上回っている。最後に、MetaBreakはプロンプトエンジニアリングと根本的に異なる戦略を採用しているため、2つのアプローチは相乗的に機能する。特に、PAPとGPTFuzzerでMetaBreakに権限を与えると、それぞれ24.3%、20.2%のジェイルブレイク率が向上する。

論文の概要: MetaBreak: Jailbreaking Online LLM Services via Special Token Manipulation

関連論文リスト