Fugu-MT 論文翻訳(概要): Generalization in Online Reinforcement Learning for Mobile Agents

論文の概要: Generalization in Online Reinforcement Learning for Mobile Agents

arxiv url: http://arxiv.org/abs/2603.07432v1
Date: Sun, 08 Mar 2026 03:08:11 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-10 15:13:14.586306
Title: Generalization in Online Reinforcement Learning for Mobile Agents
Title（参考訳）: 移動エージェントのオンライン強化学習における一般化
Authors: Li Gu, Zihuan Jiang, Zhixiang Chi, Huan Liu, Ziqiang Wang, Yuanhao Yu, Glen Berseth, Yang Wang,
Abstract要約: 我々は、目に見えないタスクインスタンス、テンプレート、アプリケーションに対してゼロショットの一般化を評価するベンチマークであるtextbfAndroidWorld-Generalizationを紹介する。 AndroidWorld-Generalizationの実験によると、RLは7BパラメータのVLMエージェントを教師付き微調整ベースラインを超えることができる。サポートと公正な比較のために、環境、タスクスイート、モデル、プロンプト構成、基盤となるインフラストラクチャを含む完全なRLトレーニングシステムをオープンソース化しました。
参考スコア（独自算出の注目度）: 32.98335803990582
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Graphical user interface (GUI)-based mobile agents automate digital tasks on mobile devices by interpreting natural-language instructions and interacting with the screen. While recent methods apply reinforcement learning (RL) to train vision-language-model(VLM) agents in interactive environments with a primary focus on performance, generalization remains underexplored due to the lack of standardized benchmarks and open-source RL systems. In this work, we formalize the problem as a Contextual Markov Decision Process (CMDP) and introduce \textbf{AndroidWorld-Generalization}, a benchmark with three increasingly challenging regimes for evaluating zero-shot generalization to unseen task instances, templates, and applications. We further propose an RL training system that integrates Group Relative Policy Optimization (GRPO) with a scalable rollout collection system, consisting of containerized infrastructure and asynchronous execution % , and error recovery to support reliable and efficient training. Experiments on AndroidWorld-Generalization show that RL enables a 7B-parameter VLM agent to surpass supervised fine-tuning baselines, yielding a 26.1\% improvement on unseen instances but only limited gains on unseen templates (15.7\%) and apps (8.3\%), underscoring the challenges of generalization. As a preliminary step, we demonstrate that few-shot adaptation at test-time improves performance on unseen apps, motivating future research in this direction. To support reproducibility and fair comparison, we open-source the full RL training system, including the environment, task suite, models, prompt configurations, and the underlying infrastructure \footnote{https://github.com/zihuanjiang/AndroidWorld-Generalization}.
Abstract（参考訳）: グラフィカルユーザインタフェース(GUI)ベースのモバイルエージェントは、自然言語命令を解釈して画面と対話することで、モバイルデバイス上のデジタルタスクを自動化する。近年,対話型環境における視覚言語モデル(VLM)エージェントの学習には強化学習(RL)が適用されているが,標準ベンチマークやオープンソースのRLシステムが欠如しているため,一般化は未検討である。本研究では,この問題をCMDP(Contextual Markov Decision Process)として定式化し,未確認のタスクインスタンス,テンプレート,アプリケーションに対してゼロショットの一般化を評価するための3つの課題を持つベンチマークである \textbf{AndroidWorld-Generalization} を導入する。さらに,グループ相対政策最適化(GRPO)をコンテナ化インフラストラクチャと非同期実行率%からなるスケーラブルなロールアウトコレクションシステムと統合し,信頼性と効率的なトレーニングを支援するためのエラー回復を行うRLトレーニングシステムを提案する。 AndroidWorld-Generalizationの実験によると、RLは7BパラメータのVLMエージェントを教師付き微調整ベースラインを越え、26.1\%改善するが、未確認テンプレート(15.7\%)とアプリ(8.3\%)でしか利益が得られず、一般化の難しさを浮き彫りにしている。予備的なステップとして、テスト時の少数ショット適応は、目に見えないアプリのパフォーマンスを改善し、この方向への将来の研究を動機付けていることを実証する。再現性と公正な比較をサポートするため,環境,タスクスイート,モデル,プロンプト構成,基盤となるインフラストラクチャである‘footnote{https://github.com/zihuanjiang/AndroidWorld-Generalization}などを含む,完全なRLトレーニングシステムをオープンソースとして公開しています。

論文の概要: Generalization in Online Reinforcement Learning for Mobile Agents

関連論文リスト