Fugu-MT 論文翻訳(概要): Succeed or Learn Slowly: Sample Efficient Off-Policy Reinforcement Learning for Mobile App Control

論文の概要: Succeed or Learn Slowly: Sample Efficient Off-Policy Reinforcement Learning for Mobile App Control

arxiv url: http://arxiv.org/abs/2509.01720v1
Date: Mon, 01 Sep 2025 18:55:27 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-04 15:17:03.824985
Title: Succeed or Learn Slowly: Sample Efficient Off-Policy Reinforcement Learning for Mobile App Control
Title（参考訳）: succeed or Learnly: モバイルアプリ制御のための効率的なオフポリティ強化学習のサンプル
Authors: Georgios Papoudakis, Thomas Coste, Jianye Hao, Jun Wang, Kun Shao,
Abstract要約: 本稿では,モバイルアプリ制御タスクで評価された新規な非政治強化学習アルゴリズムであるSucceed or Learn Slowly (SoLS)を紹介する。 SoLSは、ユーザーインターフェースナビゲーションのための微調整基礎モデルにおいて、非政治的アクター-批判的アプローチを修正することで、サンプル効率を改善する。我々は、成功した対話から学習を優先するSTR(Success Transition Replay)でSOLSを増強する。
参考スコア（独自算出の注目度）: 50.316067647636196
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning (RL) using foundation models for policy approximations in multi-turn tasks remains challenging. We identify two main limitations related to sparse reward settings and policy gradient updates, based on which we formulate a key insight: updates from positive samples with high returns typically do not require policy regularisation, whereas updates from negative samples, reflecting undesirable behaviour, can harm model performance. This paper introduces Succeed or Learn Slowly (SoLS), a novel off-policy RL algorithm evaluated on mobile app control tasks. SoLS improves sample efficiency when fine-tuning foundation models for user interface navigation via a modified off-policy actor-critic approach, applying direct policy updates for positive samples and conservative, regularised updates for negative ones to prevent model degradation. We augment SoLS with Successful Transition Replay (STR), which prioritises learning from successful interactions, further improving sample efficiency. We evaluate SoLS on the AndroidWorld benchmark, where it significantly outperforms existing methods (at least 17% relative increase), including prompt-engineering and RL approaches, while requiring substantially fewer computational resources than GPT-4o-based methods with 5-60x faster inference.
Abstract（参考訳）: マルチターンタスクにおけるポリシー近似のための基礎モデルを用いた強化学習(RL)は依然として困難である。高いリターンを持つ正のサンプルからの更新は、通常、ポリシーの正則化を必要としないが、負のサンプルからの更新は、望ましくない振る舞いを反映して、モデルのパフォーマンスを損なう可能性がある。本稿では,モバイルアプリ制御タスクで評価された新規な非政治RLアルゴリズムであるSucceed or Learn Slowly (SoLS)を紹介する。 SoLSは、ユーザインターフェースナビゲーションのための微調整基盤モデルにおいて、修正されたオフポリシーアクター-批判的アプローチによるサンプル効率の向上、正のサンプルに対する直接的なポリシー更新、負のサンプルに対する保守的な定期的なアップデートの適用により、モデル劣化を防止している。我々は、成功した相互作用から学習を優先し、サンプル効率をさらに向上するSLSを成功遷移再生(STR)で強化する。我々は,AndroidWorldベンチマークでSOLSを評価し,プロンプトエンジニアリングやRLアプローチを含む既存の手法(少なくとも17%の相対的な増加)を著しく上回りながら,5～60倍高速なGPT-4oベースの手法よりも計算資源を著しく少なくする。

論文の概要: Succeed or Learn Slowly: Sample Efficient Off-Policy Reinforcement Learning for Mobile App Control

関連論文リスト