Fugu-MT 論文翻訳(概要): Temporal UI State Inconsistency in Desktop GUI Agents: Formalizing and Defending Against TOCTOU Attacks on Computer-Use Agents

論文の概要: Temporal UI State Inconsistency in Desktop GUI Agents: Formalizing and Defending Against TOCTOU Attacks on Computer-Use Agents

arxiv url: http://arxiv.org/abs/2604.18860v1
Date: Mon, 20 Apr 2026 21:36:16 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-22 22:41:49.507465
Title: Temporal UI State Inconsistency in Desktop GUI Agents: Formalizing and Defending Against TOCTOU Attacks on Computer-Use Agents
Title（参考訳）: デスクトップGUIエージェントにおける時間的UI状態の不整合:コンピュータ利用エージェントに対するToCTOU攻撃の形式化と防御
Authors: Wenpeng Xu,
Abstract要約: スクリーンショットとクリックのループを通じてデスクトップコンピュータを制御するGUIエージェントは、新しいタイプの脆弱性を導入している。我々はこれを視覚的原子性暴力として形式化し、3つの具体的な攻撃プリミティブを特徴付ける。本稿では,アクションディスパッチの直前にUI状態を再検証する軽量な3層ディフェンスを提案する。
参考スコア（独自算出の注目度）: 0.7360807642941714
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: GUI agents that control desktop computers via screenshot-and-click loops introduce a new class of vulnerability: the observation-to-action gap (mean 6.51 s on real OSWorld workloads) creates a Time-Of-Check, Time-Of-Use (TOCTOU) window during which an unprivileged attacker can manipulate the UI state. We formalize this as a Visual Atomicity Violation and characterize three concrete attack primitives: (A) Notification Overlay Hijack, (B) Window Focus Manipulation, and (C) Web DOM Injection. Primitive B, the closest desktop analog to Android Action Rebinding, achieves 100% action-redirection success rate with zero visual evidence at the observation time. We propose Pre-execution UI State Verification (PUSV), a lightweight three-layer defense that re-verifies the UI state immediately before each action dispatch: masked pixel SSIM at the click target (L1), global screenshot diff (L2a), and X Window snapshot diff (L2b). PUSV achieves 100% Action Interception Rate across 180 adversarial trials (135 Primitive A + 45 Primitive B) with zero false positives and < 0.1 s overhead. Against Primitive C (zero-visual-footprint DOM injection), PUSV reveals a structural blind spot (~0% AIR), motivating future OS+DOM defense-in-depth architectures. No single PUSV layer alone achieves full coverage; different primitives require different detection signals, validating the layered design.
Abstract（参考訳）: スクリーンショットとクリックのループを通じてデスクトップコンピュータを制御するGUIエージェントは、新しいタイプの脆弱性を導入した。監視とアクションのギャップ(実際のOSWorldワークロードでは6.51秒)は、特権のない攻撃者がUI状態を操作できる時間-オフ、時間-オフ-Use(TOCTOU)ウィンドウを作成する。我々はこれを視覚的アトミック違反として形式化し、(A)通知オーバーレイハイジャック、(B)ウィンドウフォーカス操作、(C)Web DOMインジェクションの3つの具体的な攻撃プリミティブを特徴付ける。 Android Action Rebindingに最も近いデスクトップアナログであるPrimitive Bは、観察時の視覚的証拠をゼロに、100%のアクションリダイレクト成功率を達成する。本稿では,各アクションディスパッチの直前にUI状態を再検証する軽量な3層ディフェンスであるPUSV(Pre-execution UI State Verification)を提案し,クリックターゲット(L1),グローバルスクリーンショット差分(L2a),X Windowスナップショット差分(L2b)を提案する。 PUSVは180回の逆行試験(Primitive A + 45 Primitive B)で100%のAction Interception Rateを達成し、偽陽性はゼロ、オーバーヘッドは0.1秒である。プリミティブC(ゼロビジュアルフットプリントDOMインジェクション)に対して、PUSVは構造的な盲点(～0% AIR)を明らかにし、将来のOS+DOMディフェンスインディースアーキテクチャを動機付けている。単一のPUSV層だけでは完全なカバレッジが得られず、異なるプリミティブは異なる検出信号を必要とし、層設計を検証する。

論文の概要: Temporal UI State Inconsistency in Desktop GUI Agents: Formalizing and Defending Against TOCTOU Attacks on Computer-Use Agents

関連論文リスト