VEM: Environment-Free Exploration for Training GUI Agent with Value Environment Model
- URL: http://arxiv.org/abs/2502.18906v1
- Date: Wed, 26 Feb 2025 07:52:02 GMT
- Title: VEM: Environment-Free Exploration for Training GUI Agent with Value Environment Model
- Authors: Jiani Zheng, Lu Wang, Fangkai Yang, Chaoyun Zhang, Lingrui Mei, Wenjie Yin, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan, Qi Zhang,
- Abstract summary: We propose an environment-free RL framework that decouples value estimation from policy optimization.<n>The framework operates in two stages: (1) pretraining VEM to estimate long-term action utilities and (2) guiding policy exploration with frozen VEM signals.<n> evaluated on Android-in-the-Wild benchmarks, VEM achieves state-of-the-art performance in both offline and online settings.
- Score: 34.98047665907545
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training Vision-Language Models (VLMs) for Graphical User Interfaces (GUI) agents via Reinforcement Learning (RL) faces critical challenges: environment-based RL requires costly interactions, while environment-free methods struggle with distribution shift and reward generalization. We propose an environment-free RL framework that decouples value estimation from policy optimization by leveraging a pretrained Value Environment Model (VEM). VEM predicts state-action values directly from offline data, distilling human-like priors about GUI interaction outcomes without requiring next-state prediction or environmental feedback. This avoids compounding errors and enhances resilience to UI changes by focusing on semantic reasoning (e.g., Does this action advance the user's goal?). The framework operates in two stages: (1) pretraining VEM to estimate long-term action utilities and (2) guiding policy exploration with frozen VEM signals, enabling layout-agnostic GUI automation. Evaluated on Android-in-the-Wild benchmarks, VEM achieves state-of-the-art performance in both offline and online settings, outperforming environment-free baselines significantly and matching environment-based approaches without interaction costs. Importantly, VEM demonstrates that semantic-aware value estimation can achieve comparable performance with online-trained methods.
Related papers
- Lean and Mean: Decoupled Value Policy Optimization with Global Value Guidance [52.65461207786633]
Policy-based Reinforcement Learning from Human Feedback is essential for aligning large language models with human preferences.<n>It requires joint training of an actor and critic with a pretrained, fixed reward model for guidance.<n>We propose textbfDecoupled Value Policy Optimization (DVPO), a lean framework that replaces traditional reward modeling with a pretrained emphglobal value model (GVM)
arXiv Detail & Related papers (2025-02-24T08:11:33Z) - Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment [104.18002641195442]
We introduce Self-Augmented Preference Optimization (SAPO), an effective and scalable training paradigm that does not require existing paired data.
Building on the self-play concept, which autonomously generates negative responses, we further incorporate an off-policy learning pipeline to enhance data exploration and exploitation.
arXiv Detail & Related papers (2024-05-31T14:21:04Z) - Transfer Learning for CSI-based Positioning with Multi-environment Meta-learning [1.1763850077553188]
deep learning (DL) techniques for radio-based positioning of user equipment (UE) through channel state information (CSI) fingerprints have demonstrated significant potential.
This paper proposes a novel DL model structure consisting of two parts, where the first part aims at identifying features that are independent from any specific environment, while the second part combines those features in an environment specific way with the goal of positioning.
Our findings indicate that employing the MEML approach for initializing the weights of the DL model for a new unseen environment significantly boosts the accuracy of UE positioning in the new target environment as well the reliability of its uncertainty estimation.
arXiv Detail & Related papers (2024-05-20T06:23:22Z) - Language-Conditioned Imitation Learning with Base Skill Priors under Unstructured Data [26.004807291215258]
Language-conditioned robot manipulation aims to develop robots capable of understanding and executing complex tasks.
We propose a general-purpose, language-conditioned approach that combines base skill priors and imitation learning under unstructured data.
We assess our model's performance in both simulated and real-world environments using a zero-shot setting.
arXiv Detail & Related papers (2023-05-30T14:40:38Z) - RE-MOVE: An Adaptive Policy Design for Robotic Navigation Tasks in
Dynamic Environments via Language-Based Feedback [56.219221064727016]
Reinforcement learning-based policies for continuous control robotic navigation tasks often fail to adapt to changes in the environment during real-time deployment.
We propose a novel approach called RE-MOVE to adapt already trained policy to real-time changes in the environment without re-training via utilizing a language-based feedback.
arXiv Detail & Related papers (2023-03-14T04:20:59Z) - Rethinking Value Function Learning for Generalization in Reinforcement
Learning [11.516147824168732]
We focus on the problem of training RL agents on multiple training environments to improve observational generalization performance.
We identify that the value network in the multiple-environment setting is more challenging to optimize and prone to overfitting training data than in the conventional single-environment setting.
We propose Delayed-Critic Policy Gradient (DCPG), which implicitly penalizes the value estimates by optimizing the value network less frequently with more training data than the policy network.
arXiv Detail & Related papers (2022-10-18T16:17:47Z) - Visual-Language Navigation Pretraining via Prompt-based Environmental
Self-exploration [83.96729205383501]
We introduce prompt-based learning to achieve fast adaptation for language embeddings.
Our model can adapt to diverse vision-language navigation tasks, including VLN and REVERIE.
arXiv Detail & Related papers (2022-03-08T11:01:24Z) - Learning to Continuously Optimize Wireless Resource in a Dynamic
Environment: A Bilevel Optimization Perspective [52.497514255040514]
This work develops a new approach that enables data-driven methods to continuously learn and optimize resource allocation strategies in a dynamic environment.
We propose to build the notion of continual learning into wireless system design, so that the learning model can incrementally adapt to the new episodes.
Our design is based on a novel bilevel optimization formulation which ensures certain fairness" across different data samples.
arXiv Detail & Related papers (2021-05-03T07:23:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.