Fugu-MT 論文翻訳(概要): AFRAgent : An Adaptive Feature Renormalization Based High Resolution Aware GUI agent

論文の概要: AFRAgent : An Adaptive Feature Renormalization Based High Resolution Aware GUI agent

arxiv url: http://arxiv.org/abs/2512.00846v1
Date: Sun, 30 Nov 2025 11:32:54 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-02 19:46:34.451117
Title: AFRAgent : An Adaptive Feature Renormalization Based High Resolution Aware GUI agent
Title（参考訳）: AFRAgent : 適応的特徴正規化に基づく高分解能GUIエージェント
Authors: Neeraj Anand, Rishabh Jain, Sohan Patnaik, Balaji Krishnamurthy, Mausoom Sarkar,
Abstract要約: インストラクトBLIPに基づくマルチモーダルアーキテクチャを導入し,GUI自動化における優れた性能を実現する。低解像度画像埋め込みを効果的に強化する適応的特徴正規化(トークンレベルのアフィン変換)手法を提案する。我々はAFRAgentをMeta-GUIおよびAITWベンチマークで評価し、スマートフォン自動化のための新しい最先端のベースラインを確立する。
参考スコア（独自算出の注目度）: 21.148033135113927
License: http://creativecommons.org/licenses/by/4.0/
Abstract: There is a growing demand for mobile user interface (UI) automation, driven by its broad applications across industries. With the advent of visual language models (VLMs), GUI automation has progressed from generating text-based instructions for humans to autonomously executing tasks, thus optimizing automation workflows. Recent approaches leverage VLMs for this problem due to their ability to 1) process on-screen content directly, 2) remain independent of device-specific APIs by utilizing human actions (e.g., clicks, typing), and 3) apply real-world contextual knowledge for task understanding. However, these models often have trouble accurately identifying widgets and determining actions due to limited spatial information in vision encoder features. Additionally, top-performing models are often large, requiring extensive training and resulting in inference delays. In this work, we introduce AFRAgent, an instruct-BLIP-based multimodal architecture that achieves superior performance in GUI automation while being less than one-fourth the size of its nearest competitor. To enhance image embeddings in the large language model (LLM) pipeline, we propose an adaptive feature renormalization-based (a token-level affine transformation) technique that effectively enriches low-resolution image embeddings and fuses high-resolution details. We evaluate AFRAgent on Meta-GUI and AITW benchmarks, establishing a new state-of-the-art baseline for smartphone automation.
Abstract（参考訳）: モバイルユーザインターフェース(UI)の自動化に対する需要は、業界全体にわたる幅広いアプリケーションによって、ますます高まっている。ビジュアル言語モデル(VLM)の出現により、GUI自動化は、人間が自律的にタスクを実行するためのテキストベースの命令を生成することから、自動化ワークフローを最適化する。最近のアプローチでは、VLMをその能力のために利用している。 1)画面上のコンテンツを直接処理する。 2) ヒューマンアクション(例えば、クリック、タイピング)を活用してデバイス固有のAPIに依存せず、かつ、 3)実世界の文脈知識をタスク理解に適用する。しかし、これらのモデルでは、視覚エンコーダの機能の空間情報に制限があるため、ウィジェットを正確に識別したり、アクションを決定するのに苦労することが多い。さらに、トップパフォーマンスモデルは、しばしば大きく、広範なトレーニングを必要とし、推論の遅延を引き起こす。本稿では,命令型BLIPベースのマルチモーダルアーキテクチャであるAFRAgentを紹介する。大規模言語モデル(LLM)パイプラインにおける画像埋め込みを強化するために,低解像度画像埋め込みを効果的に強化し,高解像度の詳細を融合する適応的特徴正規化(トークンレベルのアフィン変換)手法を提案する。我々はAFRAgentをMeta-GUIおよびAITWベンチマークで評価し、スマートフォン自動化のための新しい最先端のベースラインを確立する。

論文の概要: AFRAgent : An Adaptive Feature Renormalization Based High Resolution Aware GUI agent

関連論文リスト