Explorer: Robust Collection of Interactable GUI Elements
- URL: http://arxiv.org/abs/2504.09352v1
- Date: Sat, 12 Apr 2025 22:02:29 GMT
- Title: Explorer: Robust Collection of Interactable GUI Elements
- Authors: Iason Chaimalas, Arnas Vyšniauskas, Gabriel Brostow,
- Abstract summary: We show how a given user needs confidence, that the relevant UI elements are being detected correctly throughout one app or digital environment.<n>Proposed Explorer system focuses on detecting on-screen buttons and text-entry fields, where the training process has access to a live version of the application.<n>Explorer also enables the recording of interactive user sessions, and subsequent mapping of how such sessions overlap and sometimes loop back to similar states.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automation of existing Graphical User Interfaces (GUIs) is important but hard to achieve. Upstream of making the GUI user-accessible or somehow scriptable, even the data-collection to understand the original interface poses significant challenges. For example, large quantities of general UI data seem helpful for training general machine learning (ML) models, but accessibility for each person can hinge on the ML's precision on a specific app. We therefore take the perspective that a given user needs confidence, that the relevant UI elements are being detected correctly throughout one app or digital environment. We mostly assume that the target application is known in advance, so that data collection and ML-training can be personalized for the test-time target domain. The proposed Explorer system focuses on detecting on-screen buttons and text-entry fields, i.e. interactables, where the training process has access to a live version of the application. The live application can run on almost any popular platform except iOS phones, and the collection is especially streamlined for Android phones or for desktop Chrome browsers. Explorer also enables the recording of interactive user sessions, and subsequent mapping of how such sessions overlap and sometimes loop back to similar states. We show how having such a map enables a kind of path planning through the GUI, letting a user issue audio commands to get to their destination. Critically, we are releasing our code for Explorer openly at https://github.com/varnelis/Explorer.
Related papers
- AutoGUI: Scaling GUI Grounding with Automatic Functionality Annotations from LLMs [54.58905728115257]
We propose the methodname pipeline for automatically annotating UI elements with detailed functionality descriptions at scale.
Specifically, we leverage large language models (LLMs) to infer element functionality by comparing the UI content changes before and after simulated interactions with specific UI elements.
We construct an methodname-704k dataset using the proposed pipeline, featuring multi-resolution, multi-device screenshots, diverse data domains, and detailed functionality annotations that have never been provided by previous datasets.
arXiv Detail & Related papers (2025-02-04T03:39:59Z) - WinClick: GUI Grounding with Multimodal Large Language Models [46.44235543835595]
We introduce WinClick, a novel visual GUI agent developed in Windows platform.
To overcome the challenge of GUI grounding, we enhance WinClick with GUI grounding pre-training.
We also introduce WinSpot, the first comprehensive benchmark for GUI grounding on Windows.
arXiv Detail & Related papers (2025-01-27T08:29:17Z) - UI-TARS: Pioneering Automated GUI Interaction with Native Agents [58.18100825673032]
This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions.
In the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively)
arXiv Detail & Related papers (2025-01-21T17:48:10Z) - Falcon-UI: Understanding GUI Before Following User Instructions [57.67308498231232]
We introduce an instruction-free GUI navigation dataset, termed Insight-UI dataset, to enhance model comprehension of GUI environments.<n> Insight-UI dataset is automatically generated from the Common Crawl corpus, simulating various platforms.<n>We develop the GUI agent model Falcon-UI, which is initially pretrained on Insight-UI dataset and subsequently fine-tuned on Android and Web GUI datasets.
arXiv Detail & Related papers (2024-12-12T15:29:36Z) - Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction [69.57190742976091]
We introduce Aguvis, a unified vision-based framework for autonomous GUI agents.<n>Our approach leverages image-based observations, and grounding instructions in natural language to visual elements.<n>To address the limitations of previous work, we integrate explicit planning and reasoning within the model.
arXiv Detail & Related papers (2024-12-05T18:58:26Z) - Ponder & Press: Advancing Visual GUI Agent towards General Computer Control [13.39115823642937]
Ponder & Press is a divide-and-conquer framework for general computer control using only visual input.<n>Our agent offers a versatile, human-like interaction paradigm applicable to a wide range of applications.
arXiv Detail & Related papers (2024-12-02T08:35:31Z) - GUICourse: From General Vision Language Models to Versatile GUI Agents [75.5150601913659]
We contribute GUICourse, a suite of datasets to train visual-based GUI agents.
First, we introduce the GUIEnv dataset to strengthen the OCR and grounding capabilities of VLMs.
Then, we introduce the GUIAct and GUIChat datasets to enrich their knowledge of GUI components and interactions.
arXiv Detail & Related papers (2024-06-17T08:30:55Z) - GUI-World: A Video Benchmark and Dataset for Multimodal GUI-oriented Understanding [73.9254861755974]
This paper introduces a new dataset, termed GUI-World, which features meticulously crafted Human-MLLM annotations.<n>We evaluate the capabilities of current state-of-the-art MLLMs, including Image LLMs and Video LLMs, in understanding various types of GUI content.
arXiv Detail & Related papers (2024-06-16T06:56:53Z) - Tell Me What's Next: Textual Foresight for Generic UI Representations [65.10591722192609]
We propose Textual Foresight, a novel pretraining objective for learning UI screen representations.
Textual Foresight generates global text descriptions of future UI states given a current UI and local action taken.
We train with our newly constructed mobile app dataset, OpenApp, which results in the first public dataset for app UI representation learning.
arXiv Detail & Related papers (2024-06-12T02:43:19Z) - SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents [17.43878828389188]
We propose a novel visual Graphical User Interface (GUI) agent, SeeClick, which only relies on screenshots for task automation.
To tackle this challenge, we propose to enhance SeeClick with GUI grounding pre-training and devise a method to automate the curation of GUI grounding data.
We have also created ScreenSpot, the first realistic GUI grounding benchmark that encompasses mobile, desktop, and web environments.
arXiv Detail & Related papers (2024-01-17T08:10:35Z) - META-GUI: Towards Multi-modal Conversational Agents on Mobile GUI [28.484013258445067]
We propose a new TOD architecture: GUI-based task-oriented dialogue system (GUI-TOD)
A GUI-TOD system can directly perform GUI operations on real APPs and execute tasks without invoking backend APIs.
We release META-GUI, a dataset for training a Multi-modal conversational agent on mobile GUI.
arXiv Detail & Related papers (2022-05-23T04:05:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.