Fugu-MT 論文翻訳(概要): Structured Semantic Cloaking for Jailbreak Attacks on Large Language Models

論文の概要: Structured Semantic Cloaking for Jailbreak Attacks on Large Language Models

arxiv url: http://arxiv.org/abs/2603.16192v1
Date: Tue, 17 Mar 2026 07:20:48 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-18 17:42:07.143851
Title: Structured Semantic Cloaking for Jailbreak Attacks on Large Language Models
Title（参考訳）: 大規模言語モデルにおけるジェイルブレイク攻撃のための構造的セマンティッククラック
Authors: Xiaobing Sun, Perry Lam, Shaohua Li, Zizhou Wang, Rick Siow Mong Goh, Yong Liu, Liangli Zhen,
Abstract要約: 本研究では,新しい多次元ジェイルブレイク攻撃フレームワークであるStructured Semantic Cloaking (S2C)を提案する。 S2Cはマルチステップ推論を必要とするようなセマンティックキューを戦略的に分散し、再結合する。我々は、HarmBench と JBB-Behaviors を用いて、複数のオープンソースおよびプロプライエタリ LLM 上でS2Cを評価した。
参考スコア（独自算出の注目度）: 28.741175254258422
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Modern LLMs employ safety mechanisms that extend beyond surface-level input filtering to latent semantic representations and generation-time reasoning, enabling them to recover obfuscated malicious intent during inference and refuse accordingly, and rendering many surface-level obfuscation jailbreak attacks ineffective. We propose Structured Semantic Cloaking (S2C), a novel multi-dimensional jailbreak attack framework that manipulates how malicious semantic intent is reconstructed during model inference. S2C strategically distributes and reshapes semantic cues such that full intent consolidation requires multi-step inference and long-range co-reference resolution within deeper latent representations. The framework comprises three complementary mechanisms: (1) Contextual Reframing, which embeds the request within a plausible high-stakes scenario to bias the model toward compliance; (2) Content Fragmentation, which disperses the semantic signature of the request across disjoint prompt segments; and (3) Clue-Guided Camouflage, which disguises residual semantic cues while embedding recoverable markers that guide output generation. By delaying and restructuring semantic consolidation, S2C degrades safety triggers that depend on coherent or explicitly reconstructed malicious intent at decoding time, while preserving sufficient instruction recoverability for functional output generation. We evaluate S2C across multiple open-source and proprietary LLMs using HarmBench and JBB-Behaviors, where it improves Attack Success Rate (ASR) by 12.4% and 9.7%, respectively, over the current SOTA. Notably, S2C achieves substantial gains on GPT-5-mini, outperforming the strongest baseline by 26% on JBB-Behaviors. We also analyse which combinations perform best against broad families of models, and characterise the trade-off between the extent of obfuscation versus input recoverability on jailbreak success.
Abstract（参考訳）: 現代のLSMは、表面レベルの入力フィルタリングを超えて、潜時意味表現や世代間推論まで拡張し、推論中に難解な悪意を回復し、それに従って拒否し、多くの表面レベルの難読化ジェイルブレイク攻撃を非効率にする安全メカニズムを採用している。本研究では,モデル推論中に悪意ある意味的意図がどのように再構築されるかを制御する,新しい多次元ジェイルブレイク攻撃フレームワークであるStructured Semantic Cloaking (S2C)を提案する。 S2Cは、深い潜在表現の中で、多段階の推論と長距離の共参照解決を必要とするようなセマンティックキューを戦略的に分散し、再結合する。本フレームワークは,(1)要求を高信頼シナリオ内に埋め込んだコンテキストリフレーミング,(2)要求のセマンティックシグネチャを非結合なプロンプトセグメントに分散するコンテンツフラグメンテーション,(3)残留セマンティックキューを偽装したクローズガイドカモフラージュ,の3つの補完メカニズムから構成される。セマンティック・コンソリデーションの遅延と再構築により、S2Cは、関数出力生成のための十分な命令回復性を保ちながら、復号時にコヒーレントまたは明示的に再構成された悪意のある意図に依存する安全トリガを分解する。我々は、HarmBench と JBB-Behaviors を用いて、複数のオープンソースおよびプロプライエタリ LLM で S2C を評価し、現在の SOTA に対して、攻撃成功率 (ASR) を 12.4% と 9.7% 改善した。特に、S2CはGPT-5-miniで大幅に上昇し、JBB-Behaviorsでは26%で最強のベースラインを上回った。また、どの組み合わせが幅広いモデルのファミリーに対して最適かを分析し、脱獄の成功に対する難読度と入力回復性の間のトレードオフを特徴付ける。

論文の概要: Structured Semantic Cloaking for Jailbreak Attacks on Large Language Models

関連論文リスト