Fugu-MT 論文翻訳(概要): One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue

論文の概要: One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue

arxiv url: http://arxiv.org/abs/2605.05630v1
Date: Thu, 07 May 2026 03:35:31 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-08 22:27:11.50421
Title: One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue
Title（参考訳）: マルチトゥルンダイアログにおける隠れた悪意のあるインテントに対する応答認識の防御
Authors: Xinjie Shen, Rongzhe Wei, Peizhi Niu, Haoyu Wang, Ruihan Wu, Eli Chien, Bo Li, Pin-Yu Chen, Pan Li,
Abstract要約: マルチターン対話における隠れた悪意のある意図は、大規模言語モデル(LLM)に対する脅威を増大させる近年の研究では、安全アライメントや外部ガードレールの進歩にもかかわらず、高度なガードレールを備えた近代的な商用モデルでさえも、このような攻撃に対して脆弱であることが示されている。そこで本研究では,この課題に対処するため,最も早いタイミングで候補応答を届けることによって,蓄積された相互作用が有害な作用を可能にするのに十分であることを示す。
参考スコア（独自算出の注目度）: 55.98008208209856
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Hidden malicious intent in multi-turn dialogue poses a growing threat to deployed large language models (LLMs). Rather than exposing a harmful objective in a single prompt, increasingly capable attackers can distribute their intent across multiple benign-looking turns. Recent studies show that even modern commercial models with advanced guardrails remain vulnerable to such attacks despite advances in safety alignment and external guardrails. In this work, we address this challenge by detecting the earliest turn at which delivering the candidate response would make the accumulated interaction sufficient to enable harmful action. This objective requires precise turn-level intervention that identifies the harm-enabling closure point while avoiding premature refusal of benign exploratory conversations. To further support training and evaluation, we construct the Multi-Turn Intent Dataset (MTID), which contains branching attack rollouts, matched benign hard negatives, and annotations of the earliest harm-enabling turns. We show that MTID helps enable a turn-level monitor TurnGate, which substantially outperforms existing baselines in harmful-intent detection while maintaining low over-refusal rates. TurnGate further generalizes across domains, attacker pipelines, and target models. Our code is available at https://github.com/Graph-COM/TurnGate.
Abstract（参考訳）: マルチターンダイアログに隠された悪意のある意図は、デプロイされた大規模言語モデル(LLM)に対する脅威を増大させる。単一のプロンプトで有害な目的を露呈する代わりに、ますます有能な攻撃者は、その意図を複数の良心的なターンに分散することができる。近年の研究では、安全アライメントや外部ガードレールの進歩にもかかわらず、高度なガードレールを備えた近代的な商用モデルでさえも、このような攻撃に対して脆弱であることが示されている。そこで本研究では,この課題に対処するため,最も早いタイミングで候補応答を届けることによって,蓄積された相互作用が有害な作用を可能にするのに十分であることを示す。この目的は、良心的な探索的会話の早期拒絶を回避しつつ、有害なクロージャポイントを識別する正確なターンレベルの介入を必要とする。トレーニングと評価をさらに支援するために、分岐攻撃のロールアウト、マッチした良性なハード・ネガ、最も早く調和するターンのアノテーションを含むMTID(Multi-Turn Intent Dataset)を構築した。 MTIDはターンレベルモニタのTurnGateの実現に役立ち,低遅延率を維持しながら有害なインテリジェント検出において既存のベースラインを大幅に上回ることを示す。 TurnGateはさらに、ドメイン、アタッカーパイプライン、ターゲットモデルにまたがって一般化されている。私たちのコードはhttps://github.com/Graph-COM/TurnGate.comで利用可能です。

論文の概要: One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue

関連論文リスト