Fugu-MT 論文翻訳(概要): Greedy Coordinate Diffusion: Effective and Semantically Coherent Adversarial Attacks via Diffusion Guidance

論文の概要: Greedy Coordinate Diffusion: Effective and Semantically Coherent Adversarial Attacks via Diffusion Guidance

arxiv url: http://arxiv.org/abs/2606.15531v2
Date: Tue, 16 Jun 2026 14:29:31 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-17 15:01:46.734344
Title: Greedy Coordinate Diffusion: Effective and Semantically Coherent Adversarial Attacks via Diffusion Guidance
Title（参考訳）: グレディ・コーディネート・ディフュージョン : ディフュージョン・ガイダンスによる効果的なセマンティック・コヒーレント・アタック
Authors: Bohdan Turbal, Blossom Metevier, Max Springer, Aleksandra Korolova,
Abstract要約: 大規模言語モデルに対するアドリアック攻撃は、広範な研究にもかかわらず、実用的影響が限られている。本稿では,Greedy Coordinate Diffusion(GCD)について紹介する。 GCDは、敵の本来の意図に低い難易度と高い意味的固執を維持している。
参考スコア（独自算出の注目度）: 48.34904668359272
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Adversarial attacks on large language models have limited practical impact despite extensive research. Optimization-based attacks such as Greedy Coordinate Gradient (GCG) (Zou et al., 2023) produce high-perplexity, incoherent suffixes that existing defenses easily detect (Bengio et al., 2024). Moreover, attempting to enforce coherence constraints during optimization often prevents the attack from successfully eliciting the specific targeted response, resulting in low success rates against robust models. Conversely, attacks that maintain coherence often alter the semantic intent of queries; when the model complies with these altered queries, responses fail to address the adversary's original goal. In this work, we introduce Greedy Coordinate Diffusion (GCD), a novel framework that efficiently generates adversarial attacks against safety-aligned models while maintaining low perplexity and high semantic adherence to the adversary's original intent. GCD leverages the generative priors of discrete diffusion language models to guide the search for adversarial suffixes that achieve semantic coherence and adherence. Unlike GCG, GCD does not require direct gradient access, allowing it to operate in a gray-box setting. We show GCD achieves highest ASR while remaining competitive on response-quality scores, and that the constructed adversarial prompts are detected at lower rates than other methods by perplexity-based and guard-model filters.
Abstract（参考訳）: 大規模言語モデルに対する敵対的攻撃は、広範な研究にもかかわらず、実用的影響が限られている。 Greedy Coordinate Gradient (GCG) (Zou et al , 2023)のような最適化ベースの攻撃は、既存の防御が容易に検出できるような、複雑で不整合な接尾辞を生成する(Bengio et al , 2024)。さらに、最適化中にコヒーレンス制約を強制しようとすると、攻撃が特定のターゲットの応答をうまく引き出すのを防ぎ、ロバストモデルに対する成功率を低くする。逆に、コヒーレンスを維持する攻撃はクエリの意味的意図を変化させることが多く、モデルがこれらの変化したクエリに準拠すると、応答は相手の本来の目標に対処することができない。本稿では,Greedy Coordinate Diffusion(GCD)について紹介する。これは,低難易度を維持しつつ,敵の本来の意図に高いセマンティック・アテンションを維持しつつ,安全に整合したモデルに対する敵攻撃を効率的に生成する新しいフレームワークである。 GCDは、離散拡散言語モデルの生成先行を利用して、セマンティック・コヒーレンスとアテンデンスを達成する逆接尾辞の探索を導く。 GCGとは異なり、GCDは直接勾配アクセスを必要としないため、グレーボックスの設定で操作できる。 GCDは応答品質のスコアで競争力を維持しながら高いASRを達成でき、構築された逆方向のプロンプトは他の手法よりも低レートでパープレキシティベースおよびガードモデルフィルタによって検出されることを示す。

論文の概要: Greedy Coordinate Diffusion: Effective and Semantically Coherent Adversarial Attacks via Diffusion Guidance

関連論文リスト