Psychologically-Inspired, Unsupervised Inference of Perceptual Groups of
GUI Widgets from GUI Images
- URL: http://arxiv.org/abs/2206.10352v2
- Date: Wed, 24 May 2023 01:18:23 GMT
- Title: Psychologically-Inspired, Unsupervised Inference of Perceptual Groups of
GUI Widgets from GUI Images
- Authors: Mulong Xie, Zhenchang Xing, Sidong Feng, Chunyang Chen, Liming Zhu,
Xiwei Xu
- Abstract summary: We present a novel unsupervised image-based method for inferring perceptual groups of GUI widgets.
The evaluation on a dataset of 1,091 GUIs collected from 772 mobile apps and 20 UI design mockups shows that our method significantly outperforms the state-of-the-art ad-hocs-based baseline.
- Score: 21.498096538797952
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Graphical User Interface (GUI) is not merely a collection of individual and
unrelated widgets, but rather partitions discrete widgets into groups by
various visual cues, thus forming higher-order perceptual units such as tab,
menu, card or list. The ability to automatically segment a GUI into perceptual
groups of widgets constitutes a fundamental component of visual intelligence to
automate GUI design, implementation and automation tasks. Although humans can
partition a GUI into meaningful perceptual groups of widgets in a highly
reliable way, perceptual grouping is still an open challenge for computational
approaches. Existing methods rely on ad-hoc heuristics or supervised machine
learning that is dependent on specific GUI implementations and runtime
information. Research in psychology and biological vision has formulated a set
of principles (i.e., Gestalt theory of perception) that describe how humans
group elements in visual scenes based on visual cues like connectivity,
similarity, proximity and continuity. These principles are domain-independent
and have been widely adopted by practitioners to structure content on GUIs to
improve aesthetic pleasant and usability. Inspired by these principles, we
present a novel unsupervised image-based method for inferring perceptual groups
of GUI widgets. Our method requires only GUI pixel images, is independent of
GUI implementation, and does not require any training data. The evaluation on a
dataset of 1,091 GUIs collected from 772 mobile apps and 20 UI design mockups
shows that our method significantly outperforms the state-of-the-art ad-hoc
heuristics-based baseline. Our perceptual grouping method creates the
opportunities for improving UI-related software engineering tasks.
Related papers
- ShowUI: One Vision-Language-Action Model for GUI Visual Agent [80.50062396585004]
Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity.
We develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations.
ShowUI, a lightweight 2B model using 256K data, achieves a strong 75.1% accuracy in zero-shot screenshot grounding.
arXiv Detail & Related papers (2024-11-26T14:29:47Z) - Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents [20.08996257335876]
We advocate a human-like embodiment for GUI agents that perceive the environment entirely visually and directly take pixel-level operations on the GUI.
We collect the largest dataset for GUI visual grounding so far, containing 10M GUI elements and their referring expressions over 1.3M screenshots.
We show that a simple recipe, which includes web-based synthetic data and slight adaptation of the LLaVA architecture, is surprisingly effective for training such visual grounding models.
arXiv Detail & Related papers (2024-10-07T17:47:50Z) - GUICourse: From General Vision Language Models to Versatile GUI Agents [75.5150601913659]
We contribute GUICourse, a suite of datasets to train visual-based GUI agents.
First, we introduce the GUIEnv dataset to strengthen the OCR and grounding capabilities of VLMs.
Then, we introduce the GUIAct and GUIChat datasets to enrich their knowledge of GUI components and interactions.
arXiv Detail & Related papers (2024-06-17T08:30:55Z) - GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents [73.9254861755974]
This paper introduces a new dataset, called GUI-World, which features meticulously crafted Human-MLLM annotations.
We evaluate the capabilities of current state-of-the-art MLLMs, including ImageLLMs and VideoLLMs, in understanding various types of GUI content.
arXiv Detail & Related papers (2024-06-16T06:56:53Z) - VideoGUI: A Benchmark for GUI Automation from Instructional Videos [78.97292966276706]
VideoGUI is a novel multi-modal benchmark designed to evaluate GUI assistants on visual-centric GUI tasks.
Sourced from high-quality web instructional videos, our benchmark focuses on tasks involving professional and novel software.
Our evaluation reveals that even the SoTA large multimodal model GPT4o performs poorly on visual-centric GUI tasks.
arXiv Detail & Related papers (2024-06-14T17:59:08Z) - Interlinking User Stories and GUI Prototyping: A Semi-Automatic LLM-based Approach [55.762798168494726]
We present a novel Large Language Model (LLM)-based approach for validating the implementation of functional NL-based requirements in a graphical user interface (GUI) prototype.
Our approach aims to detect functional user stories that are not implemented in a GUI prototype and provides recommendations for suitable GUI components directly implementing the requirements.
arXiv Detail & Related papers (2024-06-12T11:59:26Z) - Graph4GUI: Graph Neural Networks for Representing Graphical User Interfaces [27.84098739594353]
Graph4GUI exploits graph neural networks to capture individual elements' properties and semantic-visuo-spatial constraints in a layout.
The learned representation demonstrated its effectiveness in multiple tasks, especially generating designs in a challenging GUI autocompletion task.
arXiv Detail & Related papers (2024-04-21T04:06:09Z) - SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents [17.43878828389188]
We propose a novel visual Graphical User Interface (GUI) agent, SeeClick, which only relies on screenshots for task automation.
To tackle this challenge, we propose to enhance SeeClick with GUI grounding pre-training and devise a method to automate the curation of GUI grounding data.
We have also created ScreenSpot, the first realistic GUI grounding benchmark that encompasses mobile, desktop, and web environments.
arXiv Detail & Related papers (2024-01-17T08:10:35Z) - From Pixels to UI Actions: Learning to Follow Instructions via Graphical
User Interfaces [66.85108822706489]
This paper focuses on creating agents that interact with the digital world using the same conceptual interface that humans commonly use.
It is possible for such agents to outperform human crowdworkers on the MiniWob++ benchmark of GUI-based instruction following tasks.
arXiv Detail & Related papers (2023-05-31T23:39:18Z) - GUILGET: GUI Layout GEneration with Transformer [26.457270239234383]
The goal is to support the initial step of GUI design by producing realistic and diverse GUI layouts.
GUILGET is based on transformers in order to capture the semantic in relationships between elements from GUI-AG.
Our experiments, which are conducted on the CLAY dataset, reveal that our model has the best understanding of relationships from GUI-AG.
arXiv Detail & Related papers (2023-04-18T14:27:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.