Graph4GUI: Graph Neural Networks for Representing Graphical User Interfaces
- URL: http://arxiv.org/abs/2404.13521v1
- Date: Sun, 21 Apr 2024 04:06:09 GMT
- Title: Graph4GUI: Graph Neural Networks for Representing Graphical User Interfaces
- Authors: Yue Jiang, Changkong Zhou, Vikas Garg, Antti Oulasvirta,
- Abstract summary: Graph4GUI exploits graph neural networks to capture individual elements' properties and semantic-visuo-spatial constraints in a layout.
The learned representation demonstrated its effectiveness in multiple tasks, especially generating designs in a challenging GUI autocompletion task.
- Score: 27.84098739594353
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Present-day graphical user interfaces (GUIs) exhibit diverse arrangements of text, graphics, and interactive elements such as buttons and menus, but representations of GUIs have not kept up. They do not encapsulate both semantic and visuo-spatial relationships among elements. To seize machine learning's potential for GUIs more efficiently, Graph4GUI exploits graph neural networks to capture individual elements' properties and their semantic-visuo-spatial constraints in a layout. The learned representation demonstrated its effectiveness in multiple tasks, especially generating designs in a challenging GUI autocompletion task, which involved predicting the positions of remaining unplaced elements in a partially completed GUI. The new model's suggestions showed alignment and visual appeal superior to the baseline method and received higher subjective ratings for preference. Furthermore, we demonstrate the practical benefits and efficiency advantages designers perceive when utilizing our model as an autocompletion plug-in.
Related papers
- ShowUI: One Vision-Language-Action Model for GUI Visual Agent [80.50062396585004]
Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity.
We develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations.
ShowUI, a lightweight 2B model using 256K data, achieves a strong 75.1% accuracy in zero-shot screenshot grounding.
arXiv Detail & Related papers (2024-11-26T14:29:47Z) - Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents [20.08996257335876]
We advocate a human-like embodiment for GUI agents that perceive the environment entirely visually and directly take pixel-level operations on the GUI.
We collect the largest dataset for GUI visual grounding so far, containing 10M GUI elements and their referring expressions over 1.3M screenshots.
We show that a simple recipe, which includes web-based synthetic data and slight adaptation of the LLaVA architecture, is surprisingly effective for training such visual grounding models.
arXiv Detail & Related papers (2024-10-07T17:47:50Z) - GUICourse: From General Vision Language Models to Versatile GUI Agents [75.5150601913659]
We contribute GUICourse, a suite of datasets to train visual-based GUI agents.
First, we introduce the GUIEnv dataset to strengthen the OCR and grounding capabilities of VLMs.
Then, we introduce the GUIAct and GUIChat datasets to enrich their knowledge of GUI components and interactions.
arXiv Detail & Related papers (2024-06-17T08:30:55Z) - GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents [73.9254861755974]
This paper introduces a new dataset, called GUI-World, which features meticulously crafted Human-MLLM annotations.
We evaluate the capabilities of current state-of-the-art MLLMs, including ImageLLMs and VideoLLMs, in understanding various types of GUI content.
arXiv Detail & Related papers (2024-06-16T06:56:53Z) - VideoGUI: A Benchmark for GUI Automation from Instructional Videos [78.97292966276706]
VideoGUI is a novel multi-modal benchmark designed to evaluate GUI assistants on visual-centric GUI tasks.
Sourced from high-quality web instructional videos, our benchmark focuses on tasks involving professional and novel software.
Our evaluation reveals that even the SoTA large multimodal model GPT4o performs poorly on visual-centric GUI tasks.
arXiv Detail & Related papers (2024-06-14T17:59:08Z) - From Pixels to UI Actions: Learning to Follow Instructions via Graphical
User Interfaces [66.85108822706489]
This paper focuses on creating agents that interact with the digital world using the same conceptual interface that humans commonly use.
It is possible for such agents to outperform human crowdworkers on the MiniWob++ benchmark of GUI-based instruction following tasks.
arXiv Detail & Related papers (2023-05-31T23:39:18Z) - GUILGET: GUI Layout GEneration with Transformer [26.457270239234383]
The goal is to support the initial step of GUI design by producing realistic and diverse GUI layouts.
GUILGET is based on transformers in order to capture the semantic in relationships between elements from GUI-AG.
Our experiments, which are conducted on the CLAY dataset, reveal that our model has the best understanding of relationships from GUI-AG.
arXiv Detail & Related papers (2023-04-18T14:27:34Z) - Psychologically-Inspired, Unsupervised Inference of Perceptual Groups of
GUI Widgets from GUI Images [21.498096538797952]
We present a novel unsupervised image-based method for inferring perceptual groups of GUI widgets.
The evaluation on a dataset of 1,091 GUIs collected from 772 mobile apps and 20 UI design mockups shows that our method significantly outperforms the state-of-the-art ad-hocs-based baseline.
arXiv Detail & Related papers (2022-06-15T05:16:03Z) - Object Detection for Graphical User Interface: Old Fashioned or Deep
Learning or a Combination? [21.91118062303175]
We conduct the first large-scale empirical study of seven representative GUI element detection methods on over 50k GUI images.
This study sheds the light on the technical challenges to be addressed and informs the design of new GUI element detection methods.
Our evaluation on 25,000 GUI images shows that our method significantly advances the start-of-the-art performance in GUI element detection.
arXiv Detail & Related papers (2020-08-12T06:36:33Z) - GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training [62.73470368851127]
Graph representation learning has emerged as a powerful technique for addressing real-world problems.
We design Graph Contrastive Coding -- a self-supervised graph neural network pre-training framework.
We conduct experiments on three graph learning tasks and ten graph datasets.
arXiv Detail & Related papers (2020-06-17T16:18:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.