Related papers: UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding

UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding

URL: http://arxiv.org/abs/2507.22025v2
Date: Wed, 30 Jul 2025 12:17:53 GMT
Title: UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding
Authors: Shuquan Lian, Yuhang Wu, Jia Ma, Zihan Song, Bingqi Chen, Xiawu Zheng, Hui Li,
Abstract summary: We introduce UI-AGILE, a comprehensive framework enhancing GUI agents at both the training and inference stages.<n>For training, we propose a suite of improvements to the Supervised Fine-Tuning process.<n>For inference, we present Decomposed Grounding with Selection, a novel method that dramatically improves grounding accuracy on high-resolution displays.
Score: 14.42815380175443
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The emergence of Multimodal Large Language Models (MLLMs) has driven significant advances in Graphical User Interface (GUI) agent capabilities. Nevertheless, existing GUI agent training and inference techniques still suffer from a dilemma for reasoning designs, ineffective reward, and visual noise. To address these issues, we introduce UI-AGILE, a comprehensive framework enhancing GUI agents at both the training and inference stages. For training, we propose a suite of improvements to the Supervised Fine-Tuning (SFT) process: 1) a Continuous Reward function to incentivize high-precision grounding; 2) a "Simple Thinking" reward to balance planning with speed and grounding accuracy; and 3) a Cropping-based Resampling strategy to mitigate the sparse reward problem and improve learning on complex tasks. For inference, we present Decomposed Grounding with Selection, a novel method that dramatically improves grounding accuracy on high-resolution displays by breaking the image into smaller, manageable parts. Experiments show that UI-AGILE achieves the state-of-the-art performance on two benchmarks ScreenSpot-Pro and ScreenSpot-v2. For instance, using both our proposed training and inference enhancement methods brings 23% grounding accuracy improvement over the best baseline on ScreenSpot-Pro.

Related papers

GTA1: GUI Test-time Scaling Agent [77.60727242084971]
This paper investigates the two main challenges with our GUI Test-time Scaling Agent, GTA1.<n>First, to select the most appropriate action proposal, we introduce a test-time scaling method.<n>Second, we propose a model that achieves improved accuracy when grounding the selected action proposal to its corresponding visual elements.
arXiv Detail & Related papers (2025-07-08T08:52:18Z)
AgentCPM-GUI: Building Mobile-Use Agents with Reinforcement Fine-Tuning [82.42421823672954]
AgentCPM-GUI is built for robust and efficient on-device GUI interaction.<n>Our training pipeline includes grounding-aware pre-training to enhance perception.<n>AgentCPM-GUI achieves state-of-the-art performance on five public benchmarks.
arXiv Detail & Related papers (2025-06-02T07:30:29Z)
UI-Genie: A Self-Improving Approach for Iteratively Boosting MLLM-based Mobile GUI Agents [37.871793585090586]
We introduce UI-Genie, a self-improving framework addressing two key challenges in GUI agents.<n> verification of trajectory outcome is challenging and high-quality training data are not scalable.<n>We show that UI-Genie achieves state-of-the-art performance across multiple GUI agent benchmarks.
arXiv Detail & Related papers (2025-05-27T17:58:06Z)
GUI-G1: Understanding R1-Zero-Like Training for Visual Grounding in GUI Agents [15.29032612749017]
Recent Graphical User Interface (GUI) agents replicate the R1-Zero paradigm, coupling online Reinforcement Learning (RL) with explicit chain-of-thought reasoning prior to object grounding.<n>We first conduct extensive analysis experiments of three key components of that training pipeline: input design, output evaluation, and policy update.<n>Our GUI-G1-3B, trained on 17K public samples with Qwen2.5-VL-3B-Instruct, achieves 90.3% accuracy on ScreenSpot and 37.1% on ScreenSpot-Pro.
arXiv Detail & Related papers (2025-05-21T17:59:09Z)
Enhancing Visual Grounding for GUI Agents via Self-Evolutionary Reinforcement Learning [28.291759852111586]
We introduce a reinforcement learning (RL) based framework that incorporates three core strategies to improve GUI agent performance.<n>With only 3k training samples, our 7B- parameter model achieves state-of-the-art results among similarly sized models.<n> Notably, it attains 47.3% accuracy on the ScreenSpot-Pro dataset, outperforming much larger models, such as UI-TARS-72B, by a margin of 24.2%.
arXiv Detail & Related papers (2025-05-18T11:22:04Z)
Guiding VLM Agents with Process Rewards at Inference Time for GUI Navigation [101.09478572153239]
We propose an approach that guides VLM agents with process supervision by a reward model during GUI navigation and control at inference time.<n>This guidance allows the VLM agent to optimize actions at each inference step, thereby improving performance in both static and dynamic environments.
arXiv Detail & Related papers (2025-04-22T17:52:42Z)
InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners [41.22438639369124]
InfiGUI-R1 is an MLLM-based GUI agent developed through our Actor2Reasoner framework.<n>We employ Spatial Reasoning Distillation to transfer cross-modal spatial reasoning capabilities from teacher models to MLLMs.<n>We refine the basic reasoner into a deliberative one using Reinforcement Learning.
arXiv Detail & Related papers (2025-04-19T09:25:55Z)
Think Twice, Click Once: Enhancing GUI Grounding via Fast and Slow Systems [57.30711059396246]
Current Graphical User Interface (GUI) grounding systems locate interface elements based on natural language instructions.<n>Inspired by human dual-system cognition, we present Focus, a novel GUI grounding framework that combines fast prediction with systematic analysis.
arXiv Detail & Related papers (2025-03-09T06:14:17Z)
UI-TARS: Pioneering Automated GUI Interaction with Native Agents [58.18100825673032]
This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions.<n>In the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively)
arXiv Detail & Related papers (2025-01-21T17:48:10Z)
Improved GUI Grounding via Iterative Narrowing [0.03922370499388702]
We introduce a visual prompting framework that employs an iterative narrowing mechanism to improve the performance of both general and fine-tuned models in GUI grounding.<n>For evaluation, we tested our method on a comprehensive benchmark comprising various UI platforms and provided the code to reproduce our results.
arXiv Detail & Related papers (2024-11-18T05:47:12Z)
Activation to Saliency: Forming High-Quality Labels for Unsupervised Salient Object Detection [54.92703325989853]
We propose a two-stage Activation-to-Saliency (A2S) framework that effectively generates high-quality saliency cues. No human annotations are involved in our framework during the whole training process. Our framework reports significant performance compared with existing USOD methods.
arXiv Detail & Related papers (2021-12-07T11:54:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.