Spotlight: Mobile UI Understanding using Vision-Language Models with a
Focus
- URL: http://arxiv.org/abs/2209.14927v1
- Date: Thu, 29 Sep 2022 16:45:43 GMT
- Title: Spotlight: Mobile UI Understanding using Vision-Language Models with a
Focus
- Authors: Gang Li, Yang Li
- Abstract summary: We propose a vision-language model that only takes the screenshot of the UI and a region of interest on the screen as the input.
Our experiments show that our model obtains SoTA results on several representative UI tasks and outperforms previous methods.
- Score: 9.401663915424008
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Mobile UI understanding is important for enabling various interaction tasks
such as UI automation and accessibility. Previous mobile UI modeling often
depends on the view hierarchy information of a screen, which directly provides
the structural data of the UI, with the hope to bypass challenging tasks of
visual modeling from screen pixels. However, view hierarchy is not always
available, and is often corrupted with missing object descriptions or
misaligned bounding box positions. As a result, although using view hierarchy
offers some short-term gains, it may ultimately hinder the applicability and
performance of the model. In this paper, we propose Spotlight, a vision-only
approach for mobile UI understanding. Specifically, we enhance a
vision-language model that only takes the screenshot of the UI and a region of
interest on the screen -- the focus -- as the input. This general architecture
is easily scalable and capable of performing a range of UI modeling tasks. Our
experiments show that our model obtains SoTA results on several representative
UI tasks and outperforms previous methods that use both screenshots and view
hierarchies as input. Furthermore, we explore the multi-task learning and
few-shot prompting capacity of the proposed models, demonstrating promising
results in the multi-task learning direction.
Related papers
- ShowUI: One Vision-Language-Action Model for GUI Visual Agent [80.50062396585004]
Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity.
We develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations.
ShowUI, a lightweight 2B model using 256K data, achieves a strong 75.1% accuracy in zero-shot screenshot grounding.
arXiv Detail & Related papers (2024-11-26T14:29:47Z) - Tell Me What's Next: Textual Foresight for Generic UI Representations [65.10591722192609]
We propose Textual Foresight, a novel pretraining objective for learning UI screen representations.
Textual Foresight generates global text descriptions of future UI states given a current UI and local action taken.
We train with our newly constructed mobile app dataset, OpenApp, which results in the first public dataset for app UI representation learning.
arXiv Detail & Related papers (2024-06-12T02:43:19Z) - Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs [44.636020540018194]
We present Ferret-UI, a new MLLM tailored for enhanced understanding of mobile UI screens.
Ferret-UI exhibits outstanding comprehension of UI screens and the capability to execute open-ended instructions.
Ferret-UI excels not only beyond most open-source UI MLLMs, but also surpasses GPT-4V on all the elementary UI tasks.
arXiv Detail & Related papers (2024-04-08T17:55:44Z) - MiniGPT-v2: large language model as a unified interface for
vision-language multi-task learning [65.60607895153692]
MiniGPT-v2 is a model that can be treated as a unified interface for better handling various vision-language tasks.
We propose using unique identifiers for different tasks when training the model.
Our results show that MiniGPT-v2 achieves strong performance on many visual question-answering and visual grounding benchmarks.
arXiv Detail & Related papers (2023-10-14T03:22:07Z) - ILuvUI: Instruction-tuned LangUage-Vision modeling of UIs from Machine
Conversations [13.939350184164017]
Multimodal Vision-Language Models (VLMs) enable powerful applications from their fused understanding of images and language.
We adapt a recipe for generating paired text-image training data for VLMs to the UI domain by combining existing pixel-based methods with a Large Language Model (LLM)
We generate a dataset of 335K conversational examples paired with UIs that cover Q&A, UI descriptions, and planning, and use it to fine-tune a conversational VLM for UI tasks.
arXiv Detail & Related papers (2023-10-07T16:32:34Z) - Towards Better Semantic Understanding of Mobile Interfaces [7.756895821262432]
We release a human-annotated dataset with approximately 500k unique annotations aimed at increasing the understanding of the functionality of UI elements.
This dataset augments images and view hierarchies from RICO, a large dataset of mobile UIs.
We also release models using image-only and multimodal inputs; we experiment with various architectures and study the benefits of using multimodal inputs on the new dataset.
arXiv Detail & Related papers (2022-10-06T03:48:54Z) - Zero Experience Required: Plug & Play Modular Transfer Learning for
Semantic Visual Navigation [97.17517060585875]
We present a unified approach to visual navigation using a novel modular transfer learning model.
Our model can effectively leverage its experience from one source task and apply it to multiple target tasks.
Our approach learns faster, generalizes better, and outperforms SoTA models by a significant margin.
arXiv Detail & Related papers (2022-02-05T00:07:21Z) - Multimodal Icon Annotation For Mobile Applications [11.342641993269693]
We propose a novel deep learning based multi-modal approach that combines the benefits of both pixel and view hierarchy features.
In order to demonstrate the utility provided, we create a high quality UI dataset by manually annotating the most commonly used 29 icons in Rico.
arXiv Detail & Related papers (2021-07-09T13:57:37Z) - VINS: Visual Search for Mobile User Interface Design [66.28088601689069]
This paper introduces VINS, a visual search framework, that takes as input a UI image and retrieves visually similar design examples.
The framework achieves a mean Average Precision of 76.39% for the UI detection and high performance in querying similar UI designs.
arXiv Detail & Related papers (2021-02-10T01:46:33Z) - Unifying Vision-and-Language Tasks via Text Generation [81.3910771082967]
We propose a unified framework that learns different tasks in a single architecture.
Our models learn to generate labels in text based on the visual and textual inputs.
Our generative approach shows better generalization ability on answering questions that have rare answers.
arXiv Detail & Related papers (2021-02-04T17:59:30Z) - ActionBert: Leveraging User Actions for Semantic Understanding of User
Interfaces [12.52699475631247]
We introduce a new pre-trained UI representation model called ActionBert.
Our methodology is designed to leverage visual, linguistic and domain-specific features in user interaction traces to pre-train generic feature representations of UIs and their components.
Experiments show that the proposed ActionBert model outperforms multi-modal baselines across all downstream tasks by up to 15.5%.
arXiv Detail & Related papers (2020-12-22T20:49:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.