Related papers: MAI-UI Technical Report: Real-World Centric Foundation GUI Agents

MAI-UI Technical Report: Real-World Centric Foundation GUI Agents

URL: http://arxiv.org/abs/2512.22047v1
Date: Fri, 26 Dec 2025 14:51:52 GMT
Title: MAI-UI Technical Report: Real-World Centric Foundation GUI Agents
Authors: Hanzhang Zhou, Xu Zhang, Panrong Tong, Jianan Zhang, Liangyu Chen, Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang, Jingren Zhou, Steven Hoi,
Abstract summary: MAI-UI is a family of foundation GUI agents spanning the full spectrum of sizes, including 2B, 8B, 32B, and 235B-A22B variants.<n>We identify four key challenges to realistic deployment: the lack of native agent-user interaction, the limits of UI-only operation, and the absence of a practical deployment architecture.
Score: 33.46555542782679
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The development of GUI agents could revolutionize the next generation of human-computer interaction. Motivated by this vision, we present MAI-UI, a family of foundation GUI agents spanning the full spectrum of sizes, including 2B, 8B, 32B, and 235B-A22B variants. We identify four key challenges to realistic deployment: the lack of native agent-user interaction, the limits of UI-only operation, the absence of a practical deployment architecture, and brittleness in dynamic environments. MAI-UI addresses these issues with a unified methodology: a self-evolving data pipeline that expands the navigation data to include user interaction and MCP tool calls, a native device-cloud collaboration system routes execution by task state, and an online RL framework with advanced optimizations to scale parallel environments and context length. MAI-UI establishes new state-of-the-art across GUI grounding and mobile navigation. On grounding benchmarks, it reaches 73.5% on ScreenSpot-Pro, 91.3% on MMBench GUI L2, 70.9% on OSWorld-G, and 49.2% on UI-Vision, surpassing Gemini-3-Pro and Seed1.8 on ScreenSpot-Pro. On mobile GUI navigation, it sets a new SOTA of 76.7% on AndroidWorld, surpassing UI-Tars-2, Gemini-2.5-Pro and Seed1.8. On MobileWorld, MAI-UI obtains 41.7% success rate, significantly outperforming end-to-end GUI models and competitive with Gemini-3-Pro based agentic frameworks. Our online RL experiments show significant gains from scaling parallel environments from 32 to 512 (+5.2 points) and increasing environment step budget from 15 to 50 (+4.3 points). Finally, the native device-cloud collaboration system improves on-device performance by 33%, reduces cloud model calls by over 40%, and preserves user privacy.

Related papers

Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents [56.72789202127874]
The paper introduces GUI-Owl-1.5, the latest native GUI agent model.<n>It supports a range of platforms (desktop, mobile, browser, and more) to enable cloud-edge collaboration and real-time interaction.<n>It achieves state-of-the-art results on more than 20+ GUI benchmarks on open-source models.
arXiv Detail & Related papers (2026-02-15T01:52:19Z)
UI-Venus-1.5 Technical Report [64.4832043785725]
We present UI-Venus-1.5, a unified, end-to-end GUI Agent.<n>The proposed model family comprises two dense variants (2B and 8B) and one mixture-of-experts variant (30B-A3B)<n>In addition, UI-Venus-1.5 demonstrates robust navigation capabilities across a variety of Chinese mobile apps.
arXiv Detail & Related papers (2026-02-09T18:43:40Z)
UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning [51.54456545661045]
We introduce the Instruction-as-Reasoning paradigm, treating instructions as dynamic analytical pathways that offer distinct perspectives.<n>To achieve this, we propose a two-stage training framework: supervised fine-tuning and reinforcement learning.<n>Our resulting models, UI-Ins-7B and UI-Ins-32B, achieve state-of-the-art results on five challenging grounding benchmarks.
arXiv Detail & Related papers (2025-10-23T07:18:32Z)
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning [155.51875080423883]
The development of autonomous agents for graphical user interfaces presents major challenges in artificial intelligence.<n>We present UI-TARS-2, a native GUI-centered agent model that addresses these challenges through a systematic training methodology.<n> Empirical evaluation demonstrates that UI-TARS-2 achieves significant improvements over its predecessor UI-TARS-1.5.
arXiv Detail & Related papers (2025-09-02T17:44:45Z)
Mobile-Agent-v3: Fundamental Agents for GUI Automation [59.775510710011325]
This paper introduces a foundational GUI agent model that achieves state-of-the-art performance among open-source end-to-end models.<n>We propose Mobile-Agent-v3, a general-purpose GUI agent framework that further improves performance to 73.3 on AndroidWorld and 37.7 on OSWorld.
arXiv Detail & Related papers (2025-08-21T00:39:12Z)
UI-Venus Technical Report: Building High-performance UI Agents with RFT [43.28453678270454]
We present UI-Venus, a native UI agent that takes only screenshots as input based on a multimodal large language model.<n>It achieves SOTA performance on both UI grounding and navigation tasks using only several hundred thousand high-quality training samples.
arXiv Detail & Related papers (2025-08-14T16:58:07Z)
UI-TARS: Pioneering Automated GUI Interaction with Native Agents [58.18100825673032]
This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions.<n>In the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively)
arXiv Detail & Related papers (2025-01-21T17:48:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.