Orcust: Stepwise-Feedback Reinforcement Learning for GUI Agent
- URL: http://arxiv.org/abs/2509.17917v1
- Date: Mon, 22 Sep 2025 15:40:31 GMT
- Title: Orcust: Stepwise-Feedback Reinforcement Learning for GUI Agent
- Authors: Junyu Lu, Songxin Zhang, Zejian Xie, Zhuoyang Song, Jiaxing Zhang,
- Abstract summary: Orcust is a framework that integrates Principle-Constrained Reward Modeling and Online VM-Grounded Trajectory Construction.<n>OVTC spins up instrumented virtual machines to autonomously collect structured GUI interaction trajectories.
- Score: 12.334063115362758
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recent advances in GUI agents have achieved remarkable grounding and action-prediction performance, yet existing models struggle with unreliable reward signals and limited online trajectory generation. In this paper, we introduce Orcust, a framework that integrates Principle-Constrained Reward Modeling (PCRM) and Online VM-Grounded Trajectory Construction (OVTC) to enhance reasoning reliability and data efficiency in interactive GUI tasks. We leverages environment-verifiable and LLM-derived principle to enforce interpretable reward signals that constrain long chain-of-thought reasoning and rule-based feedback. OVTC spins up instrumented virtual machines to autonomously collect structured GUI interaction trajectories with explicit procedural and structural objectives, enabling the training of a stepwise reward model that robustly captures human preferences and adheres to task-specific constraints. Extensive experiments on standard GUI benchmarks covering perceptual grounding, foundational operations, and end-to-end task execution reveal that Orcust achieves state-of-the-art performance, improving by 22.2\% on ScreenSpot and 23.9\% on ScreenSpot-Pro over the base model (i.e. Qwen2.5-VL-7B). The results demonstrate Orcust's effectiveness in enhancing the reasoning, adaptability and scalability of GUI agents across various environments and task complexities.
Related papers
- CGL: Advancing Continual GUI Learning via Reinforcement Fine-Tuning [67.78566256784404]
Supervised Fine-Tuning (SFT) facilitates fast adaptation, it often triggers knowledge overwriting.<n>Reinforcement Learning (RL) demonstrates an inherent resilience that shields prior interaction logic from erasure.<n>We propose a textbfContinual textbfGUI textbfLearning framework that balances efficiency and skill retention.
arXiv Detail & Related papers (2026-03-03T13:02:20Z) - GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL [64.8155693023222]
Open-source native GUI agents still lag behind closed-source systems on long-horizon navigation tasks.<n>This gap stems from a shortage of high-quality, action-aligned reasoning data.<n>We present GUI-Libra, a tailored training recipe that addresses these challenges.
arXiv Detail & Related papers (2026-02-25T18:34:57Z) - POINTS-GUI-G: GUI-Grounding Journey [22.35782799756431]
We introduce POINTS-GUIG-8B, which achieves state-of-the-art performance with scores of 59.9 on ScreenSpotPro, 66.0 on OSWorld-G, 95.7 on ScreenSpot-v2, and 49.9 on UIVision.<n>Our model's success is driven by three key factors: (1) Refined Data Engineering; (2) Improved Training Strategies; and (3) Reinforcement Learning with Verifiable Rewards.
arXiv Detail & Related papers (2026-02-06T05:14:11Z) - MagicGUI-RMS: A Multi-Agent Reward Model System for Self-Evolving GUI Agents via Automated Feedback Reflux [37.49192877577783]
We present MagicGUI-RMS, a multi-agent reward model system that delivers adaptive trajectory evaluation, corrective feedback, and self-evolving learning capabilities.<n>To support reward learning at scale, we design a structured data construction pipeline that automatically produces balanced and diverse reward datasets.<n>Experiments demonstrate that MagicGUI-RMS yields substantial gains in task accuracy, behavioral robustness.
arXiv Detail & Related papers (2026-01-19T13:50:43Z) - UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning [155.51875080423883]
The development of autonomous agents for graphical user interfaces presents major challenges in artificial intelligence.<n>We present UI-TARS-2, a native GUI-centered agent model that addresses these challenges through a systematic training methodology.<n> Empirical evaluation demonstrates that UI-TARS-2 achieves significant improvements over its predecessor UI-TARS-1.5.
arXiv Detail & Related papers (2025-09-02T17:44:45Z) - CRAFT-GUI: Curriculum-Reinforced Agent For GUI Tasks [11.121687042616974]
Reinforcement Learning (RL) can effectively enhance agents' performance in dynamic interactive GUI environments.<n>Most approaches collapse task-specific nuances into a single, coarse reward, leaving the agent with a uniform signal that yields inefficient policy updates.<n>We propose CRAFT-GUI, a curriculum learning framework based on Group Relative Policy Optimization (GRPO) that explicitly accounts for the varying difficulty across trajectories.
arXiv Detail & Related papers (2025-08-15T09:55:02Z) - UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding [16.939058522414836]
We introduce UI-AGILE for enhancing GUI agents at both training and inference.<n>For training, we propose a suite of improvements to the Supervised Fine-Tuning (SFT) process.<n>For inference, we present decomposed grounding with selection to dramatically improve grounding accuracy on high-resolution displays.
arXiv Detail & Related papers (2025-07-29T17:22:07Z) - MobileGUI-RL: Advancing Mobile GUI Agent through Reinforcement Learning in Online Environment [63.62778707277929]
MobileGUI-RL is a scalable framework that trains GUI agent in online environment.<n>It synthesizes a curriculum of learnable tasks through self-exploration and filtering.<n>It adapts GRPO to GUI navigation with trajectory-aware advantages and composite rewards.
arXiv Detail & Related papers (2025-07-08T07:07:53Z) - Look Before You Leap: A GUI-Critic-R1 Model for Pre-Operative Error Diagnosis in GUI Automation [83.92224427735859]
We introduce a pre-operative critic mechanism that provides effective feedback prior to the actual execution.<n>We develop a reasoning-bootstrapping based data collection pipeline to create a GUI-Critic-Train and a GUI-Critic-Test.<n>Our model offers significant advantages in critic accuracy compared to current MLLMs.
arXiv Detail & Related papers (2025-06-05T04:12:36Z) - UI-Genie: A Self-Improving Approach for Iteratively Boosting MLLM-based Mobile GUI Agents [37.871793585090586]
We introduce UI-Genie, a self-improving framework addressing two key challenges in GUI agents.<n> verification of trajectory outcome is challenging and high-quality training data are not scalable.<n>We show that UI-Genie achieves state-of-the-art performance across multiple GUI agent benchmarks.
arXiv Detail & Related papers (2025-05-27T17:58:06Z) - Guiding VLM Agents with Process Rewards at Inference Time for GUI Navigation [101.09478572153239]
We propose an approach that guides VLM agents with process supervision by a reward model during GUI navigation and control at inference time.<n>This guidance allows the VLM agent to optimize actions at each inference step, thereby improving performance in both static and dynamic environments.
arXiv Detail & Related papers (2025-04-22T17:52:42Z) - InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners [41.22438639369124]
InfiGUI-R1 is an MLLM-based GUI agent developed through our Actor2Reasoner framework.<n>We employ Spatial Reasoning Distillation to transfer cross-modal spatial reasoning capabilities from teacher models to MLLMs.<n>We refine the basic reasoner into a deliberative one using Reinforcement Learning.
arXiv Detail & Related papers (2025-04-19T09:25:55Z) - Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction [69.57190742976091]
Aguvis is a vision-based framework for autonomous GUI agents.<n>It standardizes cross-platform interactions and incorporates structured reasoning via inner monologue.<n>It achieves state-of-the-art performance across offline and real-world online benchmarks.
arXiv Detail & Related papers (2024-12-05T18:58:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.