MobileFlow: A Multimodal LLM For Mobile GUI Agent
- URL: http://arxiv.org/abs/2407.04346v3
- Date: Fri, 06 Dec 2024 07:43:34 GMT
- Title: MobileFlow: A Multimodal LLM For Mobile GUI Agent
- Authors: Songqin Nong, Jiali Zhu, Rui Wu, Jiongchao Jin, Shuo Shan, Xiutian Huang, Wenhao Xu,
- Abstract summary: This paper introduces MobileFlow, a multimodal large language model meticulously crafted for mobile GUI agents.
MobileFlow contains approximately 21 billion parameters and is equipped with novel hybrid visual encoders.
It has the capacity to fully interpret image data and comprehend user instructions for GUI interaction tasks.
- Score: 4.7619361168442005
- License:
- Abstract: Currently, the integration of mobile Graphical User Interfaces (GUIs) is ubiquitous in most people's daily lives. And the ongoing evolution of multimodal large-scale models, such as GPT-4v, Qwen-VL-Max, has significantly bolstered the capabilities of GUI comprehension and user action analysis, showcasing the potentiality of intelligent GUI assistants. However, current GUI Agents often need to access page layout information through calling system APIs, which may pose privacy risks. Fixing GUI (such as mobile interfaces) to a certain low resolution might result in the loss of fine-grained image details. At the same time, the multimodal large models built for GUI Agents currently have poor understanding and decision-making abilities for Chinese GUI interfaces, making them difficult to apply to a large number of Chinese apps. This paper introduces MobileFlow, a multimodal large language model meticulously crafted for mobile GUI agents. Transforming from the open-source model Qwen-VL-Chat into GUI domain, MobileFlow contains approximately 21 billion parameters and is equipped with novel hybrid visual encoders, making it possible for variable resolutions of image inputs and good support for multilingual GUI. By incorporating Mixture of Experts (MoE) expansions and pioneering alignment training strategies, MobileFlow has the capacity to fully interpret image data and comprehend user instructions for GUI interaction tasks. Finally, MobileFlow outperforms Qwen-VL-Max and GPT-4v in terms of task execution by GUI agents on both public and our proposed evaluation metrics, and has been successfully deployed in real-world business contexts, proving its effectiveness for practical applications.
Related papers
- InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection [38.833925781308665]
We introduce textitInfiGUIAgent, an MLLM-based GUI Agent trained with a two-stage supervised fine-tuning pipeline.
Stage 1 enhances fundamental skills such as GUI understanding and grounding, while Stage 2 integrates hierarchical reasoning and expectation-reflection reasoning skills.
textitInfiGUIAgent achieves competitive performance on several GUI benchmarks.
arXiv Detail & Related papers (2025-01-08T15:45:21Z) - Falcon-UI: Understanding GUI Before Following User Instructions [57.67308498231232]
We introduce an instruction-free GUI navigation dataset, termed Insight-UI dataset, to enhance model comprehension of GUI environments.
Insight-UI dataset is automatically generated from the Common Crawl corpus, simulating various platforms.
We develop the GUI agent model Falcon-UI, which is initially pretrained on Insight-UI dataset and subsequently fine-tuned on Android and Web GUI datasets.
arXiv Detail & Related papers (2024-12-12T15:29:36Z) - Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction [69.57190742976091]
We introduce Aguvis, a unified vision-based framework for autonomous GUI agents.
Our approach leverages image-based observations, and grounding instructions in natural language to visual elements.
To address the limitations of previous work, we integrate explicit planning and reasoning within the model.
arXiv Detail & Related papers (2024-12-05T18:58:26Z) - Ponder & Press: Advancing Visual GUI Agent towards General Computer Control [13.39115823642937]
Ponder & Press is a divide-and-conquer framework for general computer control using only visual input.
Our agent offers a versatile, human-like interaction paradigm applicable to a wide range of applications.
arXiv Detail & Related papers (2024-12-02T08:35:31Z) - ShowUI: One Vision-Language-Action Model for GUI Visual Agent [80.50062396585004]
Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity.
We develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations.
ShowUI, a lightweight 2B model using 256K data, achieves a strong 75.1% accuracy in zero-shot screenshot grounding.
arXiv Detail & Related papers (2024-11-26T14:29:47Z) - AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents [50.39555842254652]
We introduce the Android Multi-annotation EXpo (AMEX) to advance research on AI agents in mobile scenarios.
AMEX comprises over 104K high-resolution screenshots from 110 popular mobile applications, which are annotated at multiple levels.
AMEX includes three levels of annotations: GUI interactive element grounding, GUI screen and element functionality descriptions, and complex natural language instructions.
arXiv Detail & Related papers (2024-07-03T17:59:58Z) - GUICourse: From General Vision Language Models to Versatile GUI Agents [75.5150601913659]
We contribute GUICourse, a suite of datasets to train visual-based GUI agents.
First, we introduce the GUIEnv dataset to strengthen the OCR and grounding capabilities of VLMs.
Then, we introduce the GUIAct and GUIChat datasets to enrich their knowledge of GUI components and interactions.
arXiv Detail & Related papers (2024-06-17T08:30:55Z) - GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents [73.9254861755974]
This paper introduces a new dataset, called GUI-World, which features meticulously crafted Human-MLLM annotations.
We evaluate the capabilities of current state-of-the-art MLLMs, including ImageLLMs and VideoLLMs, in understanding various types of GUI content.
arXiv Detail & Related papers (2024-06-16T06:56:53Z) - META-GUI: Towards Multi-modal Conversational Agents on Mobile GUI [28.484013258445067]
We propose a new TOD architecture: GUI-based task-oriented dialogue system (GUI-TOD)
A GUI-TOD system can directly perform GUI operations on real APPs and execute tasks without invoking backend APIs.
We release META-GUI, a dataset for training a Multi-modal conversational agent on mobile GUI.
arXiv Detail & Related papers (2022-05-23T04:05:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.