Related papers: Graph4GUI: Graph Neural Networks for Representing Graphical User Interfaces

Related papers

GUI-ReRank: Enhancing GUI Retrieval with Multi-Modal LLM-based Reranking [55.762798168494726]
GUI-ReRank is a novel framework that integrates rapid embedding-based constrained retrieval models with highly effective MLLM-based reranking techniques.<n>We evaluated our approach on an established NL-based GUI retrieval benchmark.
arXiv Detail & Related papers (2025-08-05T10:17:38Z)
DiMo-GUI: Advancing Test-time Scaling in GUI Grounding via Modality-Aware Visual Reasoning [52.37530640460363]
We introduce DiMo-GUI, a training-free framework for GUI grounding.<n>Instead of treating the GUI as a monolithic image, our method splits the input into textual elements and iconic elements.<n>When predictions are ambiguous or incorrect, DiMo-GUI dynamically focuses attention by generating candidate focal regions.
arXiv Detail & Related papers (2025-06-12T03:13:21Z)
UI-E2I-Synth: Advancing GUI Grounding with Large-Scale Instruction Synthesis [15.429065788185522]
We introduce a large-scale data synthesis pipeline UI-E2I- Synth for generating varying complex instruction datasets. We propose a new GUI instruction grounding benchmark UI-I2E-Bench, which is designed to address the limitations of existing benchmarks. Our model, trained on the synthesized data, achieves superior performance in GUI instruction grounding.
arXiv Detail & Related papers (2025-04-15T14:56:21Z)
MP-GUI: Modality Perception with MLLMs for GUI Understanding [12.812289005013797]
MP-GUI is a specially designed MLLM for GUI understanding. MP-GUI features three precisely specialized perceivers to extract graphical, textual, and spatial modalities from the screen. To cope with the scarcity of training data, we also introduce a pipeline for automatically data collecting.
arXiv Detail & Related papers (2025-03-18T08:32:22Z)
Zero-Shot Prompting Approaches for LLM-based Graphical User Interface Generation [53.1000575179389]
We propose a Retrieval-Augmented GUI Generation (RAGG) approach, integrated with an LLM-based GUI retrieval re-ranking and filtering mechanism. In addition, we adapt Prompt Decomposition (PDGG) and Self-Critique (SCGG) for GUI generation. Our evaluation, which encompasses over 3,000 GUI annotations from over 100 crowd-workers with UI/UX experience, shows that SCGG, in contrast to PDGG and RAGG, can lead to more effective GUI generation.
arXiv Detail & Related papers (2024-12-15T22:17:30Z)
Falcon-UI: Understanding GUI Before Following User Instructions [57.67308498231232]
We introduce an instruction-free GUI navigation dataset, termed Insight-UI dataset, to enhance model comprehension of GUI environments. Insight-UI dataset is automatically generated from the Common Crawl corpus, simulating various platforms. We develop the GUI agent model Falcon-UI, which is initially pretrained on Insight-UI dataset and subsequently fine-tuned on Android and Web GUI datasets.
arXiv Detail & Related papers (2024-12-12T15:29:36Z)
Fragmented Layer Grouping in GUI Designs Through Graph Learning Based on Multimodal Information [12.302861965706885]
In the industrial GUI-to-code process, fragmented layers may decrease the readability and maintainability of generated code. This study proposes a graph-learning-based approach to tackle the fragmented layer grouping problem according to multi-modal information in design prototypes.
arXiv Detail & Related papers (2024-12-07T06:31:09Z)
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction [69.57190742976091]
We introduce Aguvis, a unified vision-based framework for autonomous GUI agents. Our approach leverages image-based observations, and grounding instructions in natural language to visual elements. To address the limitations of previous work, we integrate explicit planning and reasoning within the model.
arXiv Detail & Related papers (2024-12-05T18:58:26Z)
ShowUI: One Vision-Language-Action Model for GUI Visual Agent [80.50062396585004]
Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity. We develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations. ShowUI, a lightweight 2B model using 256K data, achieves a strong 75.1% accuracy in zero-shot screenshot grounding.
arXiv Detail & Related papers (2024-11-26T14:29:47Z)
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents [20.08996257335876]
We advocate a human-like embodiment for GUI agents that perceive the environment entirely visually and directly take pixel-level operations on the GUI. We collect the largest dataset for GUI visual grounding so far, containing 10M GUI elements and their referring expressions over 1.3M screenshots. We show that a simple recipe, which includes web-based synthetic data and slight adaptation of the LLaVA architecture, is surprisingly effective for training such visual grounding models.
arXiv Detail & Related papers (2024-10-07T17:47:50Z)
GUICourse: From General Vision Language Models to Versatile GUI Agents [75.5150601913659]
We contribute GUICourse, a suite of datasets to train visual-based GUI agents. First, we introduce the GUIEnv dataset to strengthen the OCR and grounding capabilities of VLMs. Then, we introduce the GUIAct and GUIChat datasets to enrich their knowledge of GUI components and interactions.
arXiv Detail & Related papers (2024-06-17T08:30:55Z)
GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents [73.9254861755974]
This paper introduces a new dataset, called GUI-World, which features meticulously crafted Human-MLLM annotations. We evaluate the capabilities of current state-of-the-art MLLMs, including ImageLLMs and VideoLLMs, in understanding various types of GUI content.
arXiv Detail & Related papers (2024-06-16T06:56:53Z)
VideoGUI: A Benchmark for GUI Automation from Instructional Videos [78.97292966276706]
VideoGUI is a novel multi-modal benchmark designed to evaluate GUI assistants on visual-centric GUI tasks. Sourced from high-quality web instructional videos, our benchmark focuses on tasks involving professional and novel software. Our evaluation reveals that even the SoTA large multimodal model GPT4o performs poorly on visual-centric GUI tasks.
arXiv Detail & Related papers (2024-06-14T17:59:08Z)
From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces [66.85108822706489]
This paper focuses on creating agents that interact with the digital world using the same conceptual interface that humans commonly use. It is possible for such agents to outperform human crowdworkers on the MiniWob++ benchmark of GUI-based instruction following tasks.
arXiv Detail & Related papers (2023-05-31T23:39:18Z)
GUILGET: GUI Layout GEneration with Transformer [26.457270239234383]
The goal is to support the initial step of GUI design by producing realistic and diverse GUI layouts. GUILGET is based on transformers in order to capture the semantic in relationships between elements from GUI-AG. Our experiments, which are conducted on the CLAY dataset, reveal that our model has the best understanding of relationships from GUI-AG.
arXiv Detail & Related papers (2023-04-18T14:27:34Z)
Psychologically-Inspired, Unsupervised Inference of Perceptual Groups of GUI Widgets from GUI Images [21.498096538797952]
We present a novel unsupervised image-based method for inferring perceptual groups of GUI widgets. The evaluation on a dataset of 1,091 GUIs collected from 772 mobile apps and 20 UI design mockups shows that our method significantly outperforms the state-of-the-art ad-hocs-based baseline.
arXiv Detail & Related papers (2022-06-15T05:16:03Z)
Object Detection for Graphical User Interface: Old Fashioned or Deep Learning or a Combination? [21.91118062303175]
We conduct the first large-scale empirical study of seven representative GUI element detection methods on over 50k GUI images. This study sheds the light on the technical challenges to be addressed and informs the design of new GUI element detection methods. Our evaluation on 25,000 GUI images shows that our method significantly advances the start-of-the-art performance in GUI element detection.
arXiv Detail & Related papers (2020-08-12T06:36:33Z)
GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training [62.73470368851127]
Graph representation learning has emerged as a powerful technique for addressing real-world problems. We design Graph Contrastive Coding -- a self-supervised graph neural network pre-training framework. We conduct experiments on three graph learning tasks and ten graph datasets.
arXiv Detail & Related papers (2020-06-17T16:18:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.