Fugu-MT 論文翻訳(概要): Hi-Agent: Hierarchical Vision-Language Agents for Mobile Device Control

論文の概要: Hi-Agent: Hierarchical Vision-Language Agents for Mobile Device Control

arxiv url: http://arxiv.org/abs/2510.14388v1
Date: Thu, 16 Oct 2025 07:38:21 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-17 21:15:14.762813
Title: Hi-Agent: Hierarchical Vision-Language Agents for Mobile Device Control
Title（参考訳）: Hi-Agent: モバイルデバイス制御のための階層型ビジョンランゲージエージェント
Authors: Zhe Wu, Hongjin Lu, Junliang Xing, Changhao Zhang, Yin Zhu, Yuhao Yang, Yuheng Jing, Kai Li, Kun Shao, Jianye Hao, Jun Wang, Yuanchun Shi,
Abstract要約: モバイル制御のためのトレーニング可能な階層型視覚言語エージェントであるHi-Agentを紹介する。 Hi-Agentは高レベルの推論モデルと、共同最適化された低レベルのアクションモデルを備えている。 Hi-Agentは、Android-in-the-Wild(AitW)ベンチマークで、新しいState-Of-The-Art(SOTA)87.9%タスクの成功率を達成した。
参考スコア（独自算出の注目度）: 72.43808515668947
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Building agents that autonomously operate mobile devices has attracted increasing attention. While Vision-Language Models (VLMs) show promise, most existing approaches rely on direct state-to-action mappings, which lack structured reasoning and planning, and thus generalize poorly to novel tasks or unseen UI layouts. We introduce Hi-Agent, a trainable hierarchical vision-language agent for mobile control, featuring a high-level reasoning model and a low-level action model that are jointly optimized. For efficient training, we reformulate multi-step decision-making as a sequence of single-step subgoals and propose a foresight advantage function, which leverages execution feedback from the low-level model to guide high-level optimization. This design alleviates the path explosion issue encountered by Group Relative Policy Optimization (GRPO) in long-horizon tasks and enables stable, critic-free joint training. Hi-Agent achieves a new State-Of-The-Art (SOTA) 87.9% task success rate on the Android-in-the-Wild (AitW) benchmark, significantly outperforming prior methods across three paradigms: prompt-based (AppAgent: 17.7%), supervised (Filtered BC: 54.5%), and reinforcement learning-based (DigiRL: 71.9%). It also demonstrates competitive zero-shot generalization on the ScreenSpot-v2 benchmark. On the more challenging AndroidWorld benchmark, Hi-Agent also scales effectively with larger backbones, showing strong adaptability in high-complexity mobile control scenarios.
Abstract（参考訳）: モバイルデバイスを自律的に運用するエージェントの構築が注目を集めている。 VLM(Vision-Language Models)は将来性を示すが、既存のほとんどのアプローチは、構造化された推論と計画が欠如しており、新しいタスクや見当たらないUIレイアウトに乏しい、直接状態から行動へのマッピングに依存している。モバイル制御のためのトレーニング可能な階層型視覚言語エージェントであるHi-Agentについて紹介する。効率的なトレーニングのために,複数ステップの意思決定を1段階のサブゴールのシーケンスとして再構成し,低レベルのモデルからの実行フィードバックを利用して高レベルの最適化を導くフォレスト・アドバンテージ関数を提案する。この設計は,グループ相対政策最適化(GRPO)が長時間の作業で直面する経路爆発問題を緩和し,安定的で批判のない共同訓練を可能にする。 Hi-Agentは、Android-in-the-Wild(AitW)ベンチマークで87.9%のタスク成功率を達成し、プロンプトベース(AppAgent: 17.7%)、教師付き(Filtered BC: 54.5%)、強化学習ベース(DigiRL: 71.9%)の3つのパラダイムで先行メソッドを著しく上回っている。またScreenSpot-v2ベンチマークでは、競合するゼロショットの一般化も示している。より困難なAndroidWorldベンチマークでは、Hi-Agentはより大きなバックボーンで効果的にスケールし、高複雑さのモバイルコントロールシナリオに強い適応性を示す。

論文の概要: Hi-Agent: Hierarchical Vision-Language Agents for Mobile Device Control

関連論文リスト