Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms
- URL: http://arxiv.org/abs/2410.18967v1
- Date: Thu, 24 Oct 2024 17:58:31 GMT
- Title: Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms
- Authors: Zhangheng Li, Keen You, Haotian Zhang, Di Feng, Harsh Agrawal, Xiujun Li, Mohana Prasad Sathya Moorthy, Jeff Nichols, Yinfei Yang, Zhe Gan,
- Abstract summary: Ferret-UI 2 is a multimodal large language model (MLLM) designed for universal UI understanding across a wide range of platforms.
Ferret-UI 2 introduces three key innovations: support for multiple platform types, high-resolution perception through adaptive scaling, and advanced task training data generation powered by GPT-4o with set-of-mark visual prompting.
- Score: 48.00193601902457
- License:
- Abstract: Building a generalist model for user interface (UI) understanding is challenging due to various foundational issues, such as platform diversity, resolution variation, and data limitation. In this paper, we introduce Ferret-UI 2, a multimodal large language model (MLLM) designed for universal UI understanding across a wide range of platforms, including iPhone, Android, iPad, Webpage, and AppleTV. Building on the foundation of Ferret-UI, Ferret-UI 2 introduces three key innovations: support for multiple platform types, high-resolution perception through adaptive scaling, and advanced task training data generation powered by GPT-4o with set-of-mark visual prompting. These advancements enable Ferret-UI 2 to perform complex, user-centered interactions, making it highly versatile and adaptable for the expanding diversity of platform ecosystems. Extensive empirical experiments on referring, grounding, user-centric advanced tasks (comprising 9 subtasks $\times$ 5 platforms), GUIDE next-action prediction dataset, and GUI-World multi-platform benchmark demonstrate that Ferret-UI 2 significantly outperforms Ferret-UI, and also shows strong cross-platform transfer capabilities.
Related papers
- UI-TARS: Pioneering Automated GUI Interaction with Native Agents [58.18100825673032]
This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions.
In the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively)
arXiv Detail & Related papers (2025-01-21T17:48:10Z) - Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction [69.57190742976091]
We introduce Aguvis, a unified vision-based framework for autonomous GUI agents.
Our approach leverages image-based observations, and grounding instructions in natural language to visual elements.
To address the limitations of previous work, we integrate explicit planning and reasoning within the model.
arXiv Detail & Related papers (2024-12-05T18:58:26Z) - MobileFlow: A Multimodal LLM For Mobile GUI Agent [4.7619361168442005]
This paper introduces MobileFlow, a multimodal large language model meticulously crafted for mobile GUI agents.
MobileFlow contains approximately 21 billion parameters and is equipped with novel hybrid visual encoders.
It has the capacity to fully interpret image data and comprehend user instructions for GUI interaction tasks.
arXiv Detail & Related papers (2024-07-05T08:37:10Z) - Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models [119.63480600733715]
We unveil Ferret-v2, a significant upgrade to Ferret, with three key designs.
A flexible approach that effortlessly handles higher image resolution, improving the model's ability to process and understand images in greater detail.
By integrating the additional DINOv2 encoder, the model learns better and diverse underlying contexts for global and fine-grained visual information.
arXiv Detail & Related papers (2024-04-11T17:56:05Z) - Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs [44.636020540018194]
We present Ferret-UI, a new MLLM tailored for enhanced understanding of mobile UI screens.
Ferret-UI exhibits outstanding comprehension of UI screens and the capability to execute open-ended instructions.
Ferret-UI excels not only beyond most open-source UI MLLMs, but also surpasses GPT-4V on all the elementary UI tasks.
arXiv Detail & Related papers (2024-04-08T17:55:44Z) - Game of Privacy: Towards Better Federated Platform Collaboration under
Privacy Restriction [95.12382372267724]
Vertical federated learning (VFL) aims to train models from cross-silo data with different feature spaces stored on different platforms.
Due to the intrinsic privacy risks of federated learning, the total amount of involved data may be constrained.
We propose to incent different platforms through a reciprocal collaboration, where all platforms can exploit multi-platform information in the VFL framework to benefit their own tasks.
arXiv Detail & Related papers (2022-02-10T16:45:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.