ActionBert: Leveraging User Actions for Semantic Understanding of User
Interfaces
- URL: http://arxiv.org/abs/2012.12350v2
- Date: Mon, 25 Jan 2021 20:37:39 GMT
- Title: ActionBert: Leveraging User Actions for Semantic Understanding of User
Interfaces
- Authors: Zecheng He, Srinivas Sunkara, Xiaoxue Zang, Ying Xu, Lijuan Liu, Nevan
Wichers, Gabriel Schubiner, Ruby Lee, Jindong Chen and Blaise Ag\"uera y
Arcas
- Abstract summary: We introduce a new pre-trained UI representation model called ActionBert.
Our methodology is designed to leverage visual, linguistic and domain-specific features in user interaction traces to pre-train generic feature representations of UIs and their components.
Experiments show that the proposed ActionBert model outperforms multi-modal baselines across all downstream tasks by up to 15.5%.
- Score: 12.52699475631247
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As mobile devices are becoming ubiquitous, regularly interacting with a
variety of user interfaces (UIs) is a common aspect of daily life for many
people. To improve the accessibility of these devices and to enable their usage
in a variety of settings, building models that can assist users and accomplish
tasks through the UI is vitally important. However, there are several
challenges to achieve this. First, UI components of similar appearance can have
different functionalities, making understanding their function more important
than just analyzing their appearance. Second, domain-specific features like
Document Object Model (DOM) in web pages and View Hierarchy (VH) in mobile
applications provide important signals about the semantics of UI elements, but
these features are not in a natural language format. Third, owing to a large
diversity in UIs and absence of standard DOM or VH representations, building a
UI understanding model with high coverage requires large amounts of training
data.
Inspired by the success of pre-training based approaches in NLP for tackling
a variety of problems in a data-efficient way, we introduce a new pre-trained
UI representation model called ActionBert. Our methodology is designed to
leverage visual, linguistic and domain-specific features in user interaction
traces to pre-train generic feature representations of UIs and their
components. Our key intuition is that user actions, e.g., a sequence of clicks
on different UI components, reveals important information about their
functionality. We evaluate the proposed model on a wide variety of downstream
tasks, ranging from icon classification to UI component retrieval based on its
natural language description. Experiments show that the proposed ActionBert
model outperforms multi-modal baselines across all downstream tasks by up to
15.5%.
Related papers
- ShowUI: One Vision-Language-Action Model for GUI Visual Agent [80.50062396585004]
Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity.
We develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations.
ShowUI, a lightweight 2B model using 256K data, achieves a strong 75.1% accuracy in zero-shot screenshot grounding.
arXiv Detail & Related papers (2024-11-26T14:29:47Z) - Harnessing Webpage UIs for Text-Rich Visual Understanding [112.01029887404296]
We propose synthesizing general multimodal instructions from webpage UIs using text-based large language models (LLMs)
These instructions are then paired with UI screenshots to train multimodal models.
We introduce MultiUI, a dataset containing 7.3 million samples from 1 million websites, covering diverse multimodal tasks and UI layouts.
arXiv Detail & Related papers (2024-10-17T17:48:54Z) - AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents [50.39555842254652]
We introduce the Android Multi-annotation EXpo (AMEX) to advance research on AI agents in mobile scenarios.
AMEX comprises over 104K high-resolution screenshots from 110 popular mobile applications, which are annotated at multiple levels.
AMEX includes three levels of annotations: GUI interactive element grounding, GUI screen and element functionality descriptions, and complex natural language instructions.
arXiv Detail & Related papers (2024-07-03T17:59:58Z) - Tell Me What's Next: Textual Foresight for Generic UI Representations [65.10591722192609]
We propose Textual Foresight, a novel pretraining objective for learning UI screen representations.
Textual Foresight generates global text descriptions of future UI states given a current UI and local action taken.
We train with our newly constructed mobile app dataset, OpenApp, which results in the first public dataset for app UI representation learning.
arXiv Detail & Related papers (2024-06-12T02:43:19Z) - UI Semantic Group Detection: Grouping UI Elements with Similar Semantics
in Mobile Graphical User Interface [10.80156450091773]
Existing studies on UI elements grouping mainly focus on a single UI-related software engineering task, and their groups vary in appearance and function.
We propose our semantic component groups that pack adjacent text and non-text elements with similar semantics.
To recognize semantic component groups on a UI page, we propose a robust, deep learning-based vision detector, UISCGD.
arXiv Detail & Related papers (2024-03-08T01:52:44Z) - MiniGPT-v2: large language model as a unified interface for
vision-language multi-task learning [65.60607895153692]
MiniGPT-v2 is a model that can be treated as a unified interface for better handling various vision-language tasks.
We propose using unique identifiers for different tasks when training the model.
Our results show that MiniGPT-v2 achieves strong performance on many visual question-answering and visual grounding benchmarks.
arXiv Detail & Related papers (2023-10-14T03:22:07Z) - ILuvUI: Instruction-tuned LangUage-Vision modeling of UIs from Machine
Conversations [13.939350184164017]
Multimodal Vision-Language Models (VLMs) enable powerful applications from their fused understanding of images and language.
We adapt a recipe for generating paired text-image training data for VLMs to the UI domain by combining existing pixel-based methods with a Large Language Model (LLM)
We generate a dataset of 335K conversational examples paired with UIs that cover Q&A, UI descriptions, and planning, and use it to fine-tune a conversational VLM for UI tasks.
arXiv Detail & Related papers (2023-10-07T16:32:34Z) - Towards Better Semantic Understanding of Mobile Interfaces [7.756895821262432]
We release a human-annotated dataset with approximately 500k unique annotations aimed at increasing the understanding of the functionality of UI elements.
This dataset augments images and view hierarchies from RICO, a large dataset of mobile UIs.
We also release models using image-only and multimodal inputs; we experiment with various architectures and study the benefits of using multimodal inputs on the new dataset.
arXiv Detail & Related papers (2022-10-06T03:48:54Z) - Spotlight: Mobile UI Understanding using Vision-Language Models with a
Focus [9.401663915424008]
We propose a vision-language model that only takes the screenshot of the UI and a region of interest on the screen as the input.
Our experiments show that our model obtains SoTA results on several representative UI tasks and outperforms previous methods.
arXiv Detail & Related papers (2022-09-29T16:45:43Z) - Fashionformer: A simple, Effective and Unified Baseline for Human
Fashion Segmentation and Recognition [80.74495836502919]
In this work, we focus on joint human fashion segmentation and attribute recognition.
We introduce the object query for segmentation and the attribute query for attribute prediction.
For attribute stream, we design a novel Multi-Layer Rendering module to explore more fine-grained features.
arXiv Detail & Related papers (2022-04-10T11:11:10Z) - UIBert: Learning Generic Multimodal Representations for UI Understanding [12.931540149350633]
We introduce a transformer-based joint image-text model trained through novel pre-training tasks on large-scale unlabeled UI data.
Our key intuition is that the heterogeneous features in a UI are self-aligned, i.e., the image and text features of UI components, are predictive of each other.
We propose five pretraining tasks utilizing this self-alignment among different features of a UI component and across various components in the same UI.
We evaluate our method on nine real-world downstream UI tasks where UIBert outperforms strong multimodal baselines by up to 9.26% accuracy.
arXiv Detail & Related papers (2021-07-29T03:51:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.