ScreenExplorer: Training a Vision-Language Model for Diverse Exploration in Open GUI World
- URL: http://arxiv.org/abs/2505.19095v1
- Date: Sun, 25 May 2025 11:13:03 GMT
- Title: ScreenExplorer: Training a Vision-Language Model for Diverse Exploration in Open GUI World
- Authors: Runliang Niu, Jinglong Ji, Yi Chang, Qi Wang,
- Abstract summary: Existing GUI agents based on vision-language models (VLMs) often fail to generalize to novel environments.<n>We introduce ScreenExplorer, a VLM trained via Group Relative Policy Optimization (GRPO) in real, dynamic, and open-ended GUI environments.<n>We also introduce a world-model-based curiosity reward function to help the agent overcome the cold-start phase of exploration.
- Score: 11.401732438387704
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rapid progress of large language models (LLMs) has sparked growing interest in building Artificial General Intelligence (AGI) within Graphical User Interface (GUI) environments. However, existing GUI agents based on LLMs or vision-language models (VLMs) often fail to generalize to novel environments and rely heavily on manually curated, diverse datasets. To overcome these limitations, we introduce ScreenExplorer, a VLM trained via Group Relative Policy Optimization(GRPO) in real, dynamic, and open-ended GUI environments. Innovatively, we introduced a world-model-based curiosity reward function to help the agent overcome the cold-start phase of exploration. Additionally, distilling experience streams further enhances the model's exploration capabilities. Our training framework enhances model exploration in open GUI environments, with trained models showing better environmental adaptation and sustained exploration compared to static deployment models. Our findings offer a scalable pathway toward AGI systems with self-improving capabilities in complex interactive settings.
Related papers
- GUI-ReRank: Enhancing GUI Retrieval with Multi-Modal LLM-based Reranking [55.762798168494726]
GUI-ReRank is a novel framework that integrates rapid embedding-based constrained retrieval models with highly effective MLLM-based reranking techniques.<n>We evaluated our approach on an established NL-based GUI retrieval benchmark.
arXiv Detail & Related papers (2025-08-05T10:17:38Z) - AgentCPM-GUI: Building Mobile-Use Agents with Reinforcement Fine-Tuning [82.42421823672954]
AgentCPM-GUI is built for robust and efficient on-device GUI interaction.<n>Our training pipeline includes grounding-aware pre-training to enhance perception.<n>AgentCPM-GUI achieves state-of-the-art performance on five public benchmarks.
arXiv Detail & Related papers (2025-06-02T07:30:29Z) - ARPO:End-to-End Policy Optimization for GUI Agents with Experience Replay [88.74638385288773]
Agentic Replay Policy Optimization improves performance on complex, long-horizon computer tasks.<n>We propose a task selection strategy that filters tasks based on baseline agent performance.<n>Experiments on the OSWorld benchmark demonstrate that ARPO achieves competitive results.
arXiv Detail & Related papers (2025-05-22T06:24:32Z) - A Survey on GUI Agents with Foundation Models Enhanced by Reinforcement Learning [13.091740188171915]
We first formalize GUI agent tasks as Markov Decision Processes and discuss typical execution environments and evaluation metrics.<n>We then review the modular architecture of (M)LLM-based GUI agents, covering Perception, Planning, and Acting modules, and trace their evolution through representative works.<n>Our summary illustrates how recent innovations in multimodal perception, decision reasoning, and adaptive action generation have significantly improved the generalization and robustness of GUI agents in complex real-world environments.
arXiv Detail & Related papers (2025-04-29T06:55:15Z) - Exploration-Driven Generative Interactive Environments [53.05314852577144]
We focus on using many virtual environments for inexpensive, automatically collected interaction data.<n>We propose a training framework merely using a random agent in virtual environments.<n>Our agent is fully independent of environment-specific rewards and thus adapts easily to new environments.
arXiv Detail & Related papers (2025-04-03T12:01:41Z) - Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction [69.57190742976091]
Aguvis is a vision-based framework for autonomous GUI agents.<n>It standardizes cross-platform interactions and incorporates structured reasoning via inner monologue.<n>It achieves state-of-the-art performance across offline and real-world online benchmarks.
arXiv Detail & Related papers (2024-12-05T18:58:26Z) - Large Language Model-Brained GUI Agents: A Survey [42.82362907348966]
multimodal models have ushered in a new era of GUI automation.<n>They have demonstrated exceptional capabilities in natural language understanding, code generation, and visual processing.<n>These agents represent a paradigm shift, enabling users to perform intricate, multi-step tasks through simple conversational commands.
arXiv Detail & Related papers (2024-11-27T12:13:39Z) - OS-ATLAS: A Foundation Action Model for Generalist GUI Agents [55.37173845836839]
OS-Atlas is a foundational GUI action model that excels at GUI grounding and OOD agentic tasks.
We are releasing the largest open-source cross-platform GUI grounding corpus to date, which contains over 13 million GUI elements.
arXiv Detail & Related papers (2024-10-30T17:10:19Z) - EDGE: Enhanced Grounded GUI Understanding with Enriched Multi-Granularity Synthetic Data [15.801018643716437]
This paper aims to enhance the GUI understanding and interacting capabilities of large vision-language models (LVLMs) through a data-driven approach.
We propose EDGE, a general data synthesis framework that automatically generates large-scale, multi-granularity training data from webpages across the Web.
Our approach significantly reduces the dependence on manual annotations, empowering researchers to harness the vast public resources available on the Web to advance their work.
arXiv Detail & Related papers (2024-10-25T10:46:17Z) - Flex: End-to-End Text-Instructed Visual Navigation from Foundation Model Features [59.892436892964376]
We investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies.<n>Our findings are synthesized in Flex (Fly lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors.<n>We demonstrate the effectiveness of this approach on a quadrotor fly-to-target task, where agents trained via behavior cloning successfully generalize to real-world scenes.
arXiv Detail & Related papers (2024-10-16T19:59:31Z) - V-Zen: Efficient GUI Understanding and Precise Grounding With A Novel Multimodal LLM [0.0]
This paper presents V-Zen, an innovative Multimodal Large Language Model (MLLM) meticulously crafted to revolutionise the domain of GUI understanding and grounding.
V-Zen establishes new benchmarks in efficient grounding and next-action prediction, thereby laying the groundwork for self-operating computer systems.
The successful integration of V-Zen and GUIDE marks the dawn of a new era in multimodal AI research, opening the door to intelligent, autonomous computing experiences.
arXiv Detail & Related papers (2024-05-24T08:21:45Z) - CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation [61.68049335444254]
Multimodal large language models (MLLMs) have shown remarkable potential as human-like autonomous language agents to interact with real-world environments.
We propose a Comprehensive Cognitive LLM Agent, CoCo-Agent, with two novel approaches, comprehensive environment perception (CEP) and conditional action prediction (CAP)
With our technical design, our agent achieves new state-of-the-art performance on AITW and META-GUI benchmarks, showing promising abilities in realistic scenarios.
arXiv Detail & Related papers (2024-02-19T08:29:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.