Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms
- URL: http://arxiv.org/abs/2410.18967v1
- Date: Thu, 24 Oct 2024 17:58:31 GMT
- Title: Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms
- Authors: Zhangheng Li, Keen You, Haotian Zhang, Di Feng, Harsh Agrawal, Xiujun Li, Mohana Prasad Sathya Moorthy, Jeff Nichols, Yinfei Yang, Zhe Gan,
- Abstract summary: Ferret-UI 2 is a multimodal large language model (MLLM) designed for universal UI understanding across a wide range of platforms.
Ferret-UI 2 introduces three key innovations: support for multiple platform types, high-resolution perception through adaptive scaling, and advanced task training data generation powered by GPT-4o with set-of-mark visual prompting.
- Score: 48.00193601902457
- License:
- Abstract: Building a generalist model for user interface (UI) understanding is challenging due to various foundational issues, such as platform diversity, resolution variation, and data limitation. In this paper, we introduce Ferret-UI 2, a multimodal large language model (MLLM) designed for universal UI understanding across a wide range of platforms, including iPhone, Android, iPad, Webpage, and AppleTV. Building on the foundation of Ferret-UI, Ferret-UI 2 introduces three key innovations: support for multiple platform types, high-resolution perception through adaptive scaling, and advanced task training data generation powered by GPT-4o with set-of-mark visual prompting. These advancements enable Ferret-UI 2 to perform complex, user-centered interactions, making it highly versatile and adaptable for the expanding diversity of platform ecosystems. Extensive empirical experiments on referring, grounding, user-centric advanced tasks (comprising 9 subtasks $\times$ 5 platforms), GUIDE next-action prediction dataset, and GUI-World multi-platform benchmark demonstrate that Ferret-UI 2 significantly outperforms Ferret-UI, and also shows strong cross-platform transfer capabilities.
Related papers
- MobileFlow: A Multimodal LLM For Mobile GUI Agent [4.7619361168442005]
This paper introduces MobileFlow, a multimodal large language model meticulously crafted for mobile GUI agents.
MobileFlow contains approximately 21 billion parameters and is equipped with novel hybrid visual encoders.
It has the capacity to fully interpret image data and comprehend user instructions for GUI interaction tasks.
arXiv Detail & Related papers (2024-07-05T08:37:10Z) - LEGENT: Open Platform for Embodied Agents [60.71847900126832]
We introduce LEGENT, an open, scalable platform for developing embodied agents using Large Language Models (LLMs) and Large Multimodal Models (LMMs)
LEGENT offers a rich, interactive 3D environment with communicable and actionable agents, paired with a user-friendly interface.
In experiments, an embryonic vision-language-action model trained on LEGENT-generated data surpasses GPT-4V in embodied tasks.
arXiv Detail & Related papers (2024-04-28T16:50:12Z) - Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models [119.63480600733715]
We unveil Ferret-v2, a significant upgrade to Ferret, with three key designs.
A flexible approach that effortlessly handles higher image resolution, improving the model's ability to process and understand images in greater detail.
By integrating the additional DINOv2 encoder, the model learns better and diverse underlying contexts for global and fine-grained visual information.
arXiv Detail & Related papers (2024-04-11T17:56:05Z) - Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs [44.636020540018194]
We present Ferret-UI, a new MLLM tailored for enhanced understanding of mobile UI screens.
Ferret-UI exhibits outstanding comprehension of UI screens and the capability to execute open-ended instructions.
Ferret-UI excels not only beyond most open-source UI MLLMs, but also surpasses GPT-4V on all the elementary UI tasks.
arXiv Detail & Related papers (2024-04-08T17:55:44Z) - Ferret: Refer and Ground Anything Anywhere at Any Granularity [93.80461625100826]
We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of understanding spatial referring of any shape or granularity within an image.
Ferret employs a novel and powerful hybrid region representation that integrates discrete coordinates and continuous features jointly to represent a region in the image.
Ferret can accept diverse region inputs, such as points, bounding boxes, and free-form shapes.
arXiv Detail & Related papers (2023-10-11T17:55:15Z) - WrapperFL: A Model Agnostic Plug-in for Industrial Federated Learning [10.909577776094782]
This paper presents a simple yet practical federated learning plug-in inspired by ensemble learning, dubbed WrapperFL.
The WrapperFL works in a plug-and-play way by simply attaching to the input and output interfaces of an existing model, without the need of re-development.
arXiv Detail & Related papers (2022-06-21T13:59:11Z) - Game of Privacy: Towards Better Federated Platform Collaboration under
Privacy Restriction [95.12382372267724]
Vertical federated learning (VFL) aims to train models from cross-silo data with different feature spaces stored on different platforms.
Due to the intrinsic privacy risks of federated learning, the total amount of involved data may be constrained.
We propose to incent different platforms through a reciprocal collaboration, where all platforms can exploit multi-platform information in the VFL framework to benefit their own tasks.
arXiv Detail & Related papers (2022-02-10T16:45:40Z) - ActionBert: Leveraging User Actions for Semantic Understanding of User
Interfaces [12.52699475631247]
We introduce a new pre-trained UI representation model called ActionBert.
Our methodology is designed to leverage visual, linguistic and domain-specific features in user interaction traces to pre-train generic feature representations of UIs and their components.
Experiments show that the proposed ActionBert model outperforms multi-modal baselines across all downstream tasks by up to 15.5%.
arXiv Detail & Related papers (2020-12-22T20:49:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.