Related papers: MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understanding

MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understanding

URL: http://arxiv.org/abs/2409.14818v2
Date: Thu, 3 Oct 2024 05:23:22 GMT
Title: MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understanding
Authors: Qinzhuo Wu, Weikai Xu, Wei Liu, Tao Tan, Jianfeng Liu, Ang Li, Jian Luan, Bin Wang, Shuo Shang,
Abstract summary: We propose MobileVLM, which includes two additional pre-training stages to enhance both intra- and inter-UI understanding. To address the lack of mobile pre-training data, we built a large Chinese mobile dataset Mobile3M from scratch, which contains 3 million UI pages. Experimental results show MobileVLM excels on both our test set and public mobile benchmarks, outperforming existing VLMs.
Score: 37.15649883702765
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, mobile AI agents based on VLMs have been gaining increasing attention. These works typically utilize VLM as a foundation, fine-tuning it with instruction-based mobile datasets. However, these VLMs are typically pre-trained on general-domain data, which often results in a lack of fundamental capabilities specific to the mobile domain. Therefore, they may struggle to recognize specific UI elements and understand intra-UI fine-grained information. In addition, the current fine-tuning task focuses on interacting with the most relevant element for the given instruction. These fine-tuned VLMs may still ignore the relationships between UI pages, neglect the roles of elements in page transitions and lack inter-UI understanding. To address issues, we propose a VLM called MobileVLM, which includes two additional pre-training stages to enhance both intra- and inter-UI understanding. We defined four UI-based pre-training tasks, enabling the model to better perceive fine-grained elements and capture page transition actions. To address the lack of mobile pre-training data, we built a large Chinese mobile dataset Mobile3M from scratch, which contains 3 million UI pages, and real-world transition actions, forming a directed graph structure. Experimental results show MobileVLM excels on both our test set and public mobile benchmarks, outperforming existing VLMs.

Related papers

Mobile-MMLU: A Mobile Intelligence Language Understanding Benchmark [45.28023118459497]
We introduce Mobile-MMLU, a large-scale benchmark dataset tailored for mobile intelligence. It consists of 16,186 questions across 80 mobile-related fields, designed to evaluate LLM performance in realistic mobile scenarios. A challenging subset, Mobile-MMLU-Pro, provides advanced evaluation similar in size to MMLU-Pro but significantly more difficult than our standard full set.
arXiv Detail & Related papers (2025-03-26T17:59:56Z)
Does Chain-of-Thought Reasoning Help Mobile GUI Agent? An Empirical Study [4.18969040567543]
This paper presents the first empirical study on the effectiveness of reasoning-enabled vision-language models (VLMs) in mobile GUI agents. We evaluate two pairs of commercial models--Gemini 2.0 Flash and Claude 3.7 Sonnet--comparing their base and reasoning-enhanced versions across two benchmarks. We surprisingly find the Claude 3.7 Sonnet reasoning model achieves state-of-the-art performance on AndroidWorld.
arXiv Detail & Related papers (2025-03-21T01:52:43Z)
Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks [85.48034185086169]
Mobile-Agent-E is a hierarchical multi-agent framework capable of self-evolution through past experience. Mobile-Agent-E achieves a 22% absolute improvement over previous state-of-the-art approaches.
arXiv Detail & Related papers (2025-01-20T20:35:46Z)
Large Language Model Performance Benchmarking on Mobile Platforms: A Thorough Evaluation [10.817783356090027]
Large language models (LLMs) increasingly integrate into every aspect of our work and daily lives. There are growing concerns about user privacy, which push the trend toward local deployment of these models. As a rapidly emerging application, we are concerned about their performance on commercial-off-the-shelf mobile devices.
arXiv Detail & Related papers (2024-10-04T17:14:59Z)
FLAME: Learning to Navigate with Multimodal LLM in Urban Environments [12.428873051106702]
Large Language Models (LLMs) have demonstrated potential in Vision-and-Language Navigation (VLN) tasks. LLMs struggle with specialized navigation tasks, yielding suboptimal performance compared to specialized VLN models. We introduce FLAME, a novel Multimodal LLM-based agent and architecture designed for urban VLN tasks.
arXiv Detail & Related papers (2024-08-20T17:57:46Z)
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks [89.24440488456405]
VisionLLM v2 is an end-to-end generalist multimodal large model (MLLM) It unifies visual perception, understanding, and generation within a single framework.
arXiv Detail & Related papers (2024-06-12T16:44:50Z)
3rd Place Solution for MeViS Track in CVPR 2024 PVUW workshop: Motion Expression guided Video Segmentation [13.622700558266658]
We propose using frozen pre-trained vision-language models (VLM) as backbones, with a specific emphasis on enhancing cross-modal feature interaction. Firstly, we use frozen convolutional CLIP backbone to generate feature-aligned vision and text features, alleviating the issue of domain gap. Secondly, we add more cross-modal feature fusion in the pipeline to enhance the utilization of multi-modal information.
arXiv Detail & Related papers (2024-06-07T11:15:03Z)
Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration [52.25473993987409]
We propose Mobile-Agent-v2, a multi-agent architecture for mobile device operation assistance. The architecture comprises three agents: planning agent, decision agent, and reflection agent. We show that Mobile-Agent-v2 achieves over a 30% improvement in task completion compared to the single-agent architecture.
arXiv Detail & Related papers (2024-06-03T05:50:00Z)
Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected Multi-Modal Large Models [76.99140362751787]
We present NuInstruct, a novel dataset with 91K multi-view video-QA pairs across 17 subtasks. We also present BEV-InMLLM, an end-to-end method for efficiently deriving instruction-aware Bird's-Eye-View features.
arXiv Detail & Related papers (2024-01-02T01:54:22Z)
VILA: On Pre-training for Visual Language Models [74.08039416548209]
We study the design options for VLM pre-training through step-by-step controllable comparisons. We build VILA, a Visual Language model family that consistently outperforms the state-of-the-art models.
arXiv Detail & Related papers (2023-12-12T18:58:18Z)
"What's important here?": Opportunities and Challenges of Using LLMs in Retrieving Information from Web Interfaces [19.656406003275713]
We study how large language models (LLMs) can be used to retrieve and locate important elements for a user given query in a web interface. Our empirical experiments show that while LLMs exhibit a reasonable level of performance in retrieving important UI elements, there is still a substantial room for improvement.
arXiv Detail & Related papers (2023-12-11T06:26:38Z)
Adapting Pre-trained Language Models to Vision-Language Tasks via Dynamic Visual Prompting [83.21164539349273]
Pre-trained language models (PLMs) have played an increasing role in multimedia research. In this paper, we focus on exploring PLMs as a stand-alone model for vision-language reasoning tasks. We propose a novel transfer learning approach for PLMs, termed Dynamic Visual Prompting (DVP)
arXiv Detail & Related papers (2023-06-01T07:19:28Z)
ActionBert: Leveraging User Actions for Semantic Understanding of User Interfaces [12.52699475631247]
We introduce a new pre-trained UI representation model called ActionBert. Our methodology is designed to leverage visual, linguistic and domain-specific features in user interaction traces to pre-train generic feature representations of UIs and their components. Experiments show that the proposed ActionBert model outperforms multi-modal baselines across all downstream tasks by up to 15.5%.
arXiv Detail & Related papers (2020-12-22T20:49:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.