Related papers: Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents

Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents

URL: http://arxiv.org/abs/2509.26539v1
Date: Tue, 30 Sep 2025 17:13:56 GMT
Title: Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents
Authors: Zhen Yang, Zi-Yi Dou, Di Feng, Forrest Huang, Anh Nguyen, Keen You, Omar Attia, Yuhao Yang, Michael Feng, Haotian Zhang, Ram Ramrakhya, Chao Jia, Jeffrey Nichols, Alexander Toshev, Yinfei Yang, Zhe Gan,
Abstract summary: Ferret-UI Lite is a compact, end-to-end GUI agent that operates across diverse platforms.<n>Ferret-UI Lite achieves competitive performance with other small-scale GUI agents.
Score: 79.81903177553684
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Developing autonomous agents that effectively interact with Graphic User Interfaces (GUIs) remains a challenging open problem, especially for small on-device models. In this paper, we present Ferret-UI Lite, a compact, end-to-end GUI agent that operates across diverse platforms, including mobile, web, and desktop. Utilizing techniques optimized for developing small models, we build our 3B Ferret-UI Lite agent through curating a diverse GUI data mixture from real and synthetic sources, strengthening inference-time performance through chain-of-thought reasoning and visual tool-use, and reinforcement learning with designed rewards. Ferret-UI Lite achieves competitive performance with other small-scale GUI agents. In GUI grounding, Ferret-UI Lite attains scores of $91.6\%$, $53.3\%$, and $61.2\%$ on the ScreenSpot-V2, ScreenSpot-Pro, and OSWorld-G benchmarks, respectively. For GUI navigation, Ferret-UI Lite achieves success rates of $28.0\%$ on AndroidWorld and $19.8\%$ on OSWorld. We share our methods and lessons learned from developing compact, on-device GUI agents.

Related papers

Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents [56.72789202127874]
The paper introduces GUI-Owl-1.5, the latest native GUI agent model.<n>It supports a range of platforms (desktop, mobile, browser, and more) to enable cloud-edge collaboration and real-time interaction.<n>It achieves state-of-the-art results on more than 20+ GUI benchmarks on open-source models.
arXiv Detail & Related papers (2026-02-15T01:52:19Z)
Mobile-Agent-v3: Fundamental Agents for GUI Automation [59.775510710011325]
This paper introduces a foundational GUI agent model that achieves state-of-the-art performance among open-source end-to-end models.<n>We propose Mobile-Agent-v3, a general-purpose GUI agent framework that further improves performance to 73.3 on AndroidWorld and 37.7 on OSWorld.
arXiv Detail & Related papers (2025-08-21T00:39:12Z)
ZeroGUI: Automating Online GUI Learning at Zero Human Cost [75.21128388931945]
We propose ZeroGUI, a scalable, online learning framework for automating GUI Agent training at Zero human cost.<n>Specifically, ZeroGUI integrates (i) VLM-based automatic task generation to produce diverse training goals from the current environment state, (ii) VLM-based automatic reward estimation to assess task success without hand-crafted evaluation functions, and (iii) two-stage online reinforcement learning to continuously interact with and learn from GUI environments.
arXiv Detail & Related papers (2025-05-29T17:59:51Z)
Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms [48.00193601902457]
Ferret-UI 2 is a multimodal large language model (MLLM) designed for universal UI understanding across a wide range of platforms.<n>Ferret-UI 2 introduces three key innovations: support for multiple platform types, high-resolution perception through adaptive scaling, and advanced task training data generation powered by GPT-4o with set-of-mark visual prompting.
arXiv Detail & Related papers (2024-10-24T17:58:31Z)
GUICourse: From General Vision Language Models to Versatile GUI Agents [75.5150601913659]
We contribute GUICourse, a suite of datasets to train visual-based GUI agents.<n>First, we introduce the GUIEnv dataset to strengthen the OCR and grounding capabilities of VLMs.<n>Then, we introduce the GUIAct and GUIChat datasets to enrich their knowledge of GUI components and interactions.
arXiv Detail & Related papers (2024-06-17T08:30:55Z)
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs [44.636020540018194]
We present Ferret-UI, a new MLLM tailored for enhanced understanding of mobile UI screens. Ferret-UI exhibits outstanding comprehension of UI screens and the capability to execute open-ended instructions. Ferret-UI excels not only beyond most open-source UI MLLMs, but also surpasses GPT-4V on all the elementary UI tasks.
arXiv Detail & Related papers (2024-04-08T17:55:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.