Fugu-MT 論文翻訳(概要): AQuaUI: Visual Token Reduction for GUI Agents with Adaptive Quadtrees

論文の概要: AQuaUI: Visual Token Reduction for GUI Agents with Adaptive Quadtrees

arxiv url: http://arxiv.org/abs/2605.19260v1
Date: Tue, 19 May 2026 02:13:29 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-20 15:03:09.071542
Title: AQuaUI: Visual Token Reduction for GUI Agents with Adaptive Quadtrees
Title（参考訳）: AQuaUI: 適応的なクアドツリーを持つGUIエージェントの視覚的トークン削減
Authors: Yuankai Li, Tinghui Zhu, Ha Min Son, Zhe Zhao, Xin Liu, Muhao Chen,
Abstract要約: 本稿では,GUIエージェントモデルのためのトレーニング不要な推論時間トークン削減手法であるAquaUIを提案する。 AQuaUIはパイプライン全体の保持トークンの空間的位置を保持し、すべての位置エンコーディングステージが一貫していることを保証する。我々は、最先端GUIエージェントモデルにAQuaUIを実装し、標準接地およびナビゲーションベンチマークで実験を行う。
参考スコア（独自算出の注目度）: 25.858928918473268
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Multimodal Models (LMMs) have recently emerged as promising backbones for GUI-agent models, where high-resolution GUI screenshots are introduced to the prompts at each iteration step. However, these screenshots exhibit highly non-uniform spatial information density: large regions may carry little information and are visually homogeneous, while key text and icons may require high visual fidelity. Existing approaches to this problem either require additional training or rely on attention-based token compression, ignoring the structured layout and spatial redundancy of GUI screenshots. To fill the gap, this paper proposes AquaUI, a training-free inference-time token reduction method for GUI agent models that utilizes the non-uniform information density in screenshots. AQuaUI constructs an adaptive quadtree on each screenshot input and keeps one representative merged token per leaf of the quadtree. AQuaUI preserves the spatial positions of retained tokens throughout the pipeline to ensure that all position-encoding stages remain consistent. To further improve temporal consistency across multi-step GUI interactions, we propose a conditional quadtree algorithm that leverages the continuity between consecutive screenshots within a single request. Specifically, it refines the current quadtree using previous quadtrees as references, helping preserve fine-grained regions across static or mildly shifted GUI states. We implement AQuaUI on state-of-the-art GUI agent models and conduct experiments on standard grounding and navigational benchmarks. AQuaUI consistently shows improved accuracy-efficiency trade-offs over prior baselines. Notably, on GUI-Owl-1.5-32B-Instruct, AQuaUI achieves up to 13.22% speedup and 29.52% fewer visual tokens while retaining 99.06% of full-token performance, suggesting that the spatial redundancy of GUI screenshots can be exploited at inference without retraining.
Abstract（参考訳）: 大規模マルチモーダルモデル(LMM)は、GUIエージェントモデルの有望なバックボーンとして最近登場し、各イテレーションステップのプロンプトに高解像度のGUIスクリーンショットが導入された。しかし、これらのスクリーンショットは、非常に一様でない空間情報密度を示しており、大きな領域は、ほとんど情報を持っておらず、視覚的に均質である一方、キーテキストとアイコンは、高い視覚的忠実度を必要とする可能性がある。既存のアプローチでは、追加のトレーニングが必要か、注意ベースのトークン圧縮に依存し、GUIスクリーンショットの構造的レイアウトと空間的冗長性を無視している。このギャップを埋めるために、スクリーンショット中の一様情報密度を利用するGUIエージェントモデルのトレーニング不要な推論時間トークン削減手法であるAquaUIを提案する。 AQuaUIは、各スクリーンショット入力に適応的なクワッドツリーを構築し、クワッドツリーの葉ごとに1つの代表マージトークンを保持する。 AQuaUIはパイプライン全体の保持トークンの空間的位置を保持し、すべての位置エンコーディングステージが一貫していることを保証する。マルチステップGUIインタラクション間の時間的一貫性をさらに向上するために,単一要求内で連続するスクリーンショット間の連続性を活用する条件付きクワッドツリーアルゴリズムを提案する。具体的には、以前のクアッドツリーを参照として使用して現在のクアッドツリーを洗練し、静的または軽微にシフトしたGUI状態にわたるきめ細かい領域の保存を支援する。我々は、最先端GUIエージェントモデルにAQuaUIを実装し、標準接地およびナビゲーションベンチマークで実験を行う。 AQuaUIは一貫して、以前のベースラインよりも精度と効率のトレードオフが改善されている。特にGUI-Owl-1.5-32B-インストラクタでは、AQuaUIは最大13.22%のスピードアップと29.52%のビジュアルトークンを達成し、完全なパフォーマンスの99.06%を維持している。

論文の概要: AQuaUI: Visual Token Reduction for GUI Agents with Adaptive Quadtrees

関連論文リスト