Fugu-MT 論文翻訳(概要): STARE: Step-wise Temporal Alignment and Red-teaming Engine for Multi-modal Toxicity Attack

論文の概要: STARE: Step-wise Temporal Alignment and Red-teaming Engine for Multi-modal Toxicity Attack

arxiv url: http://arxiv.org/abs/2605.00699v3
Date: Thu, 07 May 2026 06:02:19 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-08 17:36:05.865644
Title: STARE: Step-wise Temporal Alignment and Red-teaming Engine for Multi-modal Toxicity Attack
Title（参考訳）: STARE:マルチモーダル毒性攻撃のためのステップワイド時間アライメントと赤チームエンジン
Authors: Xutao Mao, Liangjie Zhao, Tao Liu, Xiang Zheng, Hongying Zan, Cong Wang,
Abstract要約: 敵画像テキスト入力が有害な出力を引き起こす脆弱性を特定するには、赤チームビジョンランゲージモデルが不可欠である。 STARE は階層的な強化学習フレームワークであり,車軸自体を攻撃面として扱う。 STAREは、最先端のブラックボックスとホワイトボックスのベースラインよりも、攻撃成功率が68%向上していることを示す。
参考スコア（独自算出の注目度）: 10.532127028931058
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Red-teaming Vision-Language Models is essential for identifying vulnerabilities where adversarial image-text inputs trigger toxic outputs. Existing approaches treat image generation as a black box, returning only terminal toxicity scores and leaving open the question of when and how toxic semantics emerge during multi-step synthesis. We introduce STARE, a hierarchical reinforcement learning framework that treats the denoising trajectory itself as the attack surface, under a direct white-box T2I and query-only black-box VLM setting. By coupling a high-level prompt editor with low-level T2I fine-tuning via Group Relative Policy Optimization (GRPO), STARE attains a 68% improvement in Attack Success Rate over state-of-the-art black-box and white-box baselines. More importantly, this trajectory-level view surfaces the Optimization-Induced Phase Alignment phenomenon: vanilla models exhibit diffuse toxicity, whereas adversarial optimization concentrates conceptual harms into early semantic phases and detail-oriented harms into late refinement. Targeted perturbations of either window selectively suppress different toxicity categories, indicating that this temporal structure is a genuine causal handle rather than a side effect of the hierarchical design. The phenomenon turns toxicity formation from a chaotic process into a small set of predictable vulnerability windows, providing both a potent attack engine and a basis for phase-aware safety mechanisms. Content warning: This paper contains examples of toxic content that may be offensive or disturbing.
Abstract（参考訳）: 敵画像テキスト入力が有害な出力を引き起こす脆弱性を特定するには、赤チームビジョンランゲージモデルが不可欠である。既存のアプローチでは、画像生成をブラックボックスとして扱い、終末毒性スコアのみを返却し、多段階合成中に有毒なセマンティクスがいつ、どのように出現するかという疑問を解き放つ。 STAREは階層的な強化学習フレームワークであり,攻撃面としての認知軌道自体を,直接的ホワイトボックスT2Iとクエリ専用ブラックボックスVLM設定で処理する。高レベルのプロンプトエディタと低レベルのT2I微調整をグループ相対ポリシー最適化(GRPO)を介して結合することにより、STAREは最先端のブラックボックスとホワイトボックスのベースラインよりもアタック成功率を68%向上させる。バニラモデルは拡散毒性を示し、対向最適化は初期の意味相と詳細指向の害を後期改良に集中させる。いずれかの窓の標的摂動は、異なる毒性カテゴリーを選択的に抑制し、この時間構造が階層設計の副作用ではなく真の因果ハンドルであることを示す。この現象はカオス的なプロセスから毒性の形成を予測可能な小さな脆弱性ウィンドウに変え、強力な攻撃エンジンと位相認識型安全機構の基礎を提供する。コンテンツ警告: 本論文は、攻撃的または乱暴な有害なコンテンツの例を含む。

論文の概要: STARE: Step-wise Temporal Alignment and Red-teaming Engine for Multi-modal Toxicity Attack

関連論文リスト