VEM: Environment-Free Exploration for Training GUI Agent with Value Environment Model
- URL: http://arxiv.org/abs/2502.18906v1
- Date: Wed, 26 Feb 2025 07:52:02 GMT
- Title: VEM: Environment-Free Exploration for Training GUI Agent with Value Environment Model
- Authors: Jiani Zheng, Lu Wang, Fangkai Yang, Chaoyun Zhang, Lingrui Mei, Wenjie Yin, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan, Qi Zhang,
- Abstract summary: We propose an environment-free RL framework that decouples value estimation from policy optimization.<n>The framework operates in two stages: (1) pretraining VEM to estimate long-term action utilities and (2) guiding policy exploration with frozen VEM signals.<n> evaluated on Android-in-the-Wild benchmarks, VEM achieves state-of-the-art performance in both offline and online settings.
- Score: 34.98047665907545
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training Vision-Language Models (VLMs) for Graphical User Interfaces (GUI) agents via Reinforcement Learning (RL) faces critical challenges: environment-based RL requires costly interactions, while environment-free methods struggle with distribution shift and reward generalization. We propose an environment-free RL framework that decouples value estimation from policy optimization by leveraging a pretrained Value Environment Model (VEM). VEM predicts state-action values directly from offline data, distilling human-like priors about GUI interaction outcomes without requiring next-state prediction or environmental feedback. This avoids compounding errors and enhances resilience to UI changes by focusing on semantic reasoning (e.g., Does this action advance the user's goal?). The framework operates in two stages: (1) pretraining VEM to estimate long-term action utilities and (2) guiding policy exploration with frozen VEM signals, enabling layout-agnostic GUI automation. Evaluated on Android-in-the-Wild benchmarks, VEM achieves state-of-the-art performance in both offline and online settings, outperforming environment-free baselines significantly and matching environment-based approaches without interaction costs. Importantly, VEM demonstrates that semantic-aware value estimation can achieve comparable performance with online-trained methods.
Related papers
- StyleDrive: Towards Driving-Style Aware Benchmarking of End-To-End Autonomous Driving [7.525510086747996]
Personalization has been largely overlooked in the context of end-to-end autonomous driving (E2EAD)<n>We introduce the first large-scale real-world dataset explicitly curated for personalized E2EAD.<n>We introduce the first standardized benchmark for systematically evaluating personalized E2EAD models.
arXiv Detail & Related papers (2025-06-30T15:48:38Z) - Adapting Vision-Language Models for Evaluating World Models [24.813041196394582]
We present UNIVERSE, a method for adapting Vision-language Evaluator for Rollouts in Simulated Environments under data and compute constraints.<n>We conduct a large-scale study comparing full, partial, and parameter-efficient finetuning across task formats, context lengths, sampling strategies, and data compositions.<n>The resulting unified evaluator matches the performance of task-specific baselines using a single checkpoint.
arXiv Detail & Related papers (2025-06-22T09:53:28Z) - Look Before You Leap: A GUI-Critic-R1 Model for Pre-Operative Error Diagnosis in GUI Automation [83.92224427735859]
We introduce a pre-operative critic mechanism that provides effective feedback prior to the actual execution.<n>We develop a reasoning-bootstrapping based data collection pipeline to create a GUI-Critic-Train and a GUI-Critic-Test.<n>Our model offers significant advantages in critic accuracy compared to current MLLMs.
arXiv Detail & Related papers (2025-06-05T04:12:36Z) - MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering [57.156093929365255]
Gym-style framework for systematically reinforcement learning, evaluating, and improving autonomous large language model (LLM) agents.<n>MLE-Dojo covers diverse, open-ended MLE tasks carefully curated to reflect realistic engineering scenarios.<n>Its fully executable environment supports comprehensive agent training via both supervised fine-tuning and reinforcement learning.
arXiv Detail & Related papers (2025-05-12T17:35:43Z) - Lean and Mean: Decoupled Value Policy Optimization with Global Value Guidance [52.65461207786633]
Policy-based Reinforcement Learning from Human Feedback is essential for aligning large language models with human preferences.<n>It requires joint training of an actor and critic with a pretrained, fixed reward model for guidance.<n>We propose textbfDecoupled Value Policy Optimization (DVPO), a lean framework that replaces traditional reward modeling with a pretrained emphglobal value model (GVM)
arXiv Detail & Related papers (2025-02-24T08:11:33Z) - Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment [104.18002641195442]
We introduce Self-Augmented Preference Optimization (SAPO), an effective and scalable training paradigm that does not require existing paired data.
Building on the self-play concept, which autonomously generates negative responses, we further incorporate an off-policy learning pipeline to enhance data exploration and exploitation.
arXiv Detail & Related papers (2024-05-31T14:21:04Z) - Transfer Learning for CSI-based Positioning with Multi-environment Meta-learning [1.1763850077553188]
deep learning (DL) techniques for radio-based positioning of user equipment (UE) through channel state information (CSI) fingerprints have demonstrated significant potential.
This paper proposes a novel DL model structure consisting of two parts, where the first part aims at identifying features that are independent from any specific environment, while the second part combines those features in an environment specific way with the goal of positioning.
Our findings indicate that employing the MEML approach for initializing the weights of the DL model for a new unseen environment significantly boosts the accuracy of UE positioning in the new target environment as well the reliability of its uncertainty estimation.
arXiv Detail & Related papers (2024-05-20T06:23:22Z) - Language-Conditioned Imitation Learning with Base Skill Priors under Unstructured Data [26.004807291215258]
Language-conditioned robot manipulation aims to develop robots capable of understanding and executing complex tasks.
We propose a general-purpose, language-conditioned approach that combines base skill priors and imitation learning under unstructured data.
We assess our model's performance in both simulated and real-world environments using a zero-shot setting.
arXiv Detail & Related papers (2023-05-30T14:40:38Z) - RE-MOVE: An Adaptive Policy Design for Robotic Navigation Tasks in
Dynamic Environments via Language-Based Feedback [56.219221064727016]
Reinforcement learning-based policies for continuous control robotic navigation tasks often fail to adapt to changes in the environment during real-time deployment.
We propose a novel approach called RE-MOVE to adapt already trained policy to real-time changes in the environment without re-training via utilizing a language-based feedback.
arXiv Detail & Related papers (2023-03-14T04:20:59Z) - Rethinking Value Function Learning for Generalization in Reinforcement
Learning [11.516147824168732]
We focus on the problem of training RL agents on multiple training environments to improve observational generalization performance.
We identify that the value network in the multiple-environment setting is more challenging to optimize and prone to overfitting training data than in the conventional single-environment setting.
We propose Delayed-Critic Policy Gradient (DCPG), which implicitly penalizes the value estimates by optimizing the value network less frequently with more training data than the policy network.
arXiv Detail & Related papers (2022-10-18T16:17:47Z) - Visual-Language Navigation Pretraining via Prompt-based Environmental
Self-exploration [83.96729205383501]
We introduce prompt-based learning to achieve fast adaptation for language embeddings.
Our model can adapt to diverse vision-language navigation tasks, including VLN and REVERIE.
arXiv Detail & Related papers (2022-03-08T11:01:24Z) - Learning to Continuously Optimize Wireless Resource in a Dynamic
Environment: A Bilevel Optimization Perspective [52.497514255040514]
This work develops a new approach that enables data-driven methods to continuously learn and optimize resource allocation strategies in a dynamic environment.
We propose to build the notion of continual learning into wireless system design, so that the learning model can incrementally adapt to the new episodes.
Our design is based on a novel bilevel optimization formulation which ensures certain fairness" across different data samples.
arXiv Detail & Related papers (2021-05-03T07:23:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.