Fugu-MT 論文翻訳(概要): TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning

論文の概要: TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning

arxiv url: http://arxiv.org/abs/2604.27861v1
Date: Thu, 30 Apr 2026 13:44:01 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-01 16:31:54.116116
Title: TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning
Title（参考訳）: ツインゲイト:非対称コントラスト学習による追跡不能交通における解体的ジェイルブレイクに対する国家防衛
Authors: Bowen Sun, Chaozhuo Li, Yaodong Yang, Yiwei Wang, Chaowei Xiao,
Abstract要約: 分解されたジェイルブレイクは、大きな言語モデルにとって重大な脅威となる。我々はステートフルなデュアルエンコーダ防御フレームワークであるTwinGateを紹介する。我々は、8600の異なる悪意のある意図にまたがる360万以上の命令の包括的なデータセットを構築した。
参考スコア（独自算出の注目度）: 60.68349524623048
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Decompositional jailbreaks pose a critical threat to large language models (LLMs) by allowing adversaries to fragment a malicious objective into a sequence of individually benign queries that collectively reconstruct prohibited content. In real-world deployments, LLMs face a continuous, untraceable stream of fully anonymized and arbitrarily interleaved requests, infiltrated by covertly distributed adversarial queries. Under this rigorous threat model, state-of-the-art defensive strategies exhibit fundamental limitations. In the absence of trustworthy user metadata, they are incapable of tracking global historical contexts, while their deployment of generative models for real-time monitoring introduces computationally prohibitive overhead. To address this, we present TwinGate, a stateful dual-encoder defense framework. TwinGate employs Asymmetric Contrastive Learning (ACL) to cluster semantically disparate but intent-matched malicious fragments in a shared latent space, while a parallel frozen encoder suppresses false positives arising from benign topical overlap. Each request requires only a single lightweight forward pass, enabling the defense to execute in parallel with the target model's prefill phase at negligible latency overhead. To evaluate our approach and advance future research, we construct a comprehensive dataset of over 3.62 million instructions spanning 8,600 distinct malicious intents. Evaluated on this large-scale corpus under a strictly causal protocol, TwinGate achieves high malicious intent recall at a remarkably low false positive rate while remaining highly robust against adaptive attacks. Furthermore, our proposal substantially outperforms stateful and stateless baselines, delivering superior throughput and reduced latency.
Abstract（参考訳）: 分解ジェイルブレイクは、敵が悪意ある目的を、禁止されたコンテンツを集合的に再構築する個々の良質なクエリのシーケンスに断片化できるようにすることで、大きな言語モデル(LLM)に重大な脅威をもたらす。現実のデプロイメントでは、LLMは、完全に匿名化され、任意にインターリーブされたリクエストの連続的で追跡不能なストリームに直面し、秘密に分散された逆クエリによって侵入される。この厳格な脅威モデルの下では、最先端の防衛戦略は基本的な限界を示す。信頼に値するユーザメタデータがないため、グローバルな歴史的コンテキストを追跡できない一方で、リアルタイム監視のための生成モデルのデプロイは、計算的に禁止されたオーバーヘッドをもたらす。これを解決するために、ステートフルなデュアルエンコーダ防御フレームワークであるTwinGateを紹介します。 TwinGateは、非対称コントラスト学習(ACL)を使用して、意味的に異なるが意図にマッチした悪意のある断片を共有潜在空間にクラスタリングし、一方、並列凍結エンコーダは、良質なトピックオーバーラップに起因する偽陽性を抑制する。各要求は、単一の軽量のフォワードパスのみを必要とするため、防御は、無視可能な遅延オーバーヘッドで、ターゲットモデルのプリフィルフェーズと並行して実行される。我々のアプローチを評価し,今後の研究を進めるために,8600件の異なる悪意のある意図にまたがる360万以上の命令の包括的データセットを構築した。厳格な因果プロトコルの下で、この大規模なコーパスに基づいて評価され、TwinGateは、極めて低い偽陽性率で高い悪意のあるインテントリコールを達成する一方で、適応攻撃に対して非常に堅牢なままである。さらに、当社の提案はステートフルとステートレスのベースラインを大幅に上回り、優れたスループットとレイテンシの低減を実現しています。

論文の概要: TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning

関連論文リスト