UI-Venus Technical Report: Building High-performance UI Agents with RFT
- URL: http://arxiv.org/abs/2508.10833v2
- Date: Fri, 15 Aug 2025 14:49:07 GMT
- Title: UI-Venus Technical Report: Building High-performance UI Agents with RFT
- Authors: Zhangxuan Gu, Zhengwen Zeng, Zhenyu Xu, Xingran Zhou, Shuheng Shen, Yunfei Liu, Beitong Zhou, Changhua Meng, Tianyu Xia, Weizhi Chen, Yue Wen, Jingya Dou, Fei Tang, Jinzhen Lin, Yulin Liu, Zhenlin Guo, Yichen Gong, Heng Jia, Changlong Gao, Yuan Guo, Yong Deng, Zhenyu Guo, Liang Chen, Weiqiang Wang,
- Abstract summary: We present UI-Venus, a native UI agent that takes only screenshots as input based on a multimodal large language model.<n>It achieves SOTA performance on both UI grounding and navigation tasks using only several hundred thousand high-quality training samples.
- Score: 43.28453678270454
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We present UI-Venus, a native UI agent that takes only screenshots as input based on a multimodal large language model. UI-Venus achieves SOTA performance on both UI grounding and navigation tasks using only several hundred thousand high-quality training samples through reinforcement finetune (RFT) based on Qwen2.5-VL. Specifically, the 7B and 72B variants of UI-Venus obtain 94.1% / 50.8% and 95.3% / 61.9% on the standard grounding benchmarks, i.e., Screenspot-V2 / Pro, surpassing the previous SOTA baselines including open-source GTA1 and closed-source UI-TARS-1.5. To show UI-Venus's summary and planing ability, we also evaluate it on the AndroidWorld, an online UI navigation arena, on which our 7B and 72B variants achieve 49.1% and 65.9% success rate, also beating existing models. To achieve this, we introduce carefully designed reward functions for both UI grounding and navigation tasks and corresponding efficient data cleaning strategies. To further boost navigation performance, we propose Self-Evolving Trajectory History Alignment & Sparse Action Enhancement that refine historical reasoning traces and balances the distribution of sparse but critical actions, leading to more coherent planning and better generalization in complex UI tasks. Our contributions include the publish of SOTA open-source UI agents, comprehensive data cleaning protocols and a novel self-evolving framework for improving navigation performance, which encourage further research and development in the community. Code is available at https://github.com/inclusionAI/UI-Venus.
Related papers
- UI-Venus-1.5 Technical Report [64.4832043785725]
We present UI-Venus-1.5, a unified, end-to-end GUI Agent.<n>The proposed model family comprises two dense variants (2B and 8B) and one mixture-of-experts variant (30B-A3B)<n>In addition, UI-Venus-1.5 demonstrates robust navigation capabilities across a variety of Chinese mobile apps.
arXiv Detail & Related papers (2026-02-09T18:43:40Z) - OmegaUse: Building a General-Purpose GUI Agent for Autonomous Task Execution [32.992104943415995]
OmegaUse is a general-purpose GUI agent model for autonomous task execution on both mobile and desktop platforms.<n>It is highly competitive across established GUI benchmarks, achieving a state-of-the-art (SOTA) score of 96.3% on ScreenSpot-V2.<n>It also performs strongly on OS-Nav, reaching 74.24% step success on ChiM-Nav and 55.9% average success on Ubu-Nav.
arXiv Detail & Related papers (2026-01-28T08:45:17Z) - FocusUI: Efficient UI Grounding via Position-Preserving Visual Token Selection [81.25070759820589]
Vision-Language Models (VLMs) have shown remarkable performance in User Interface (UI) grounding tasks.<n>VLMs are tokenized into thousands of visual tokens, incurring significant computational overhead and diluting attention.<n>We propose FocusUI, an efficient UI grounding framework that selects patches most relevant to the instruction.
arXiv Detail & Related papers (2026-01-07T13:48:12Z) - MAI-UI Technical Report: Real-World Centric Foundation GUI Agents [33.46555542782679]
MAI-UI is a family of foundation GUI agents spanning the full spectrum of sizes, including 2B, 8B, 32B, and 235B-A22B variants.<n>We identify four key challenges to realistic deployment: the lack of native agent-user interaction, the limits of UI-only operation, and the absence of a practical deployment architecture.
arXiv Detail & Related papers (2025-12-26T14:51:52Z) - UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning [51.54456545661045]
We introduce the Instruction-as-Reasoning paradigm, treating instructions as dynamic analytical pathways that offer distinct perspectives.<n>To achieve this, we propose a two-stage training framework: supervised fine-tuning and reinforcement learning.<n>Our resulting models, UI-Ins-7B and UI-Ins-32B, achieve state-of-the-art results on five challenging grounding benchmarks.
arXiv Detail & Related papers (2025-10-23T07:18:32Z) - UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning [155.51875080423883]
The development of autonomous agents for graphical user interfaces presents major challenges in artificial intelligence.<n>We present UI-TARS-2, a native GUI-centered agent model that addresses these challenges through a systematic training methodology.<n> Empirical evaluation demonstrates that UI-TARS-2 achieves significant improvements over its predecessor UI-TARS-1.5.
arXiv Detail & Related papers (2025-09-02T17:44:45Z) - Toward Autonomous UI Exploration: The UIExplorer Benchmark [10.669221849705165]
We introduce UIExplore-Bench, the first benchmark explicitly dedicated to UI exploration.<n>The benchmark evaluates agents with either Structured mode (granting access to layout information like DOM trees) or Screen mode (relying on GUI-only observations such as screenshots and human-like mouse/keyboard interactions) across three levels in a standardized GitLab sandbox environment.<n>Our results show that UIExplore-AlGo achieves the leading mean hUFO scores, reaching up to 77.2% of human performance in Structured mode and 59.0% in Screen mode at 2,000 steps, particularly excelling at the Sparse level.
arXiv Detail & Related papers (2025-06-21T18:16:27Z) - UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning [31.796328505473305]
We propose UI-R1, the first framework to explore how rule-based RL can enhance the reasoning capabilities of multimodal large language models (MLLMs) for GUI action prediction tasks.<n>Specifically, UI-R1 introduces a novel rule-based action reward, enabling model optimization via policy-based algorithms such as Group Relative Policy Optimization (GRPO)<n>For efficient training, we curate a small yet high-quality dataset of 136 challenging tasks, encompassing five common action types on mobile devices.
arXiv Detail & Related papers (2025-03-27T15:39:30Z) - UI-TARS: Pioneering Automated GUI Interaction with Native Agents [58.18100825673032]
This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions.<n>In the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively)
arXiv Detail & Related papers (2025-01-21T17:48:10Z) - Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction [69.57190742976091]
Aguvis is a vision-based framework for autonomous GUI agents.<n>It standardizes cross-platform interactions and incorporates structured reasoning via inner monologue.<n>It achieves state-of-the-art performance across offline and real-world online benchmarks.
arXiv Detail & Related papers (2024-12-05T18:58:26Z) - ShowUI: One Vision-Language-Action Model for GUI Visual Agent [80.50062396585004]
Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity.
We develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations.
ShowUI, a lightweight 2B model using 256K data, achieves a strong 75.1% accuracy in zero-shot screenshot grounding.
arXiv Detail & Related papers (2024-11-26T14:29:47Z) - AutoGLM: Autonomous Foundation Agents for GUIs [51.276965515952]
We present AutoGLM, a new series in the ChatGLM family, designed to serve as foundation agents for autonomous control of digital devices through Graphical User Interfaces (GUIs)
We have developed AutoGLM as a practical foundation agent system for real-world GUI interactions.
Our evaluations demonstrate AutoGLM's effectiveness across multiple domains.
arXiv Detail & Related papers (2024-10-28T17:05:10Z) - GUIOdyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices [47.98821056800437]
We present GUIOdyssey, a dataset for cross-app mobile GUI navigation.<n>GuiOdyssey comprises 8,334 episodes with an average of 15.3 steps per episode, covering 6 mobile devices, 212 distinct apps, and 1,357 app combinations.<n>We develop OdysseyAgent, an exploratory multimodal agent for long-step cross-app navigation equipped with a history resampler module.
arXiv Detail & Related papers (2024-06-12T17:44:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.