Object Detection for Graphical User Interface: Old Fashioned or Deep
Learning or a Combination?
- URL: http://arxiv.org/abs/2008.05132v2
- Date: Mon, 7 Sep 2020 12:57:33 GMT
- Title: Object Detection for Graphical User Interface: Old Fashioned or Deep
Learning or a Combination?
- Authors: Jieshan Chen, Mulong Xie, Zhenchang Xing, Chunyang Chen, Xiwei Xu,
Liming Zhu and Guoqiang Li
- Abstract summary: We conduct the first large-scale empirical study of seven representative GUI element detection methods on over 50k GUI images.
This study sheds the light on the technical challenges to be addressed and informs the design of new GUI element detection methods.
Our evaluation on 25,000 GUI images shows that our method significantly advances the start-of-the-art performance in GUI element detection.
- Score: 21.91118062303175
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Detecting Graphical User Interface (GUI) elements in GUI images is a
domain-specific object detection task. It supports many software engineering
tasks, such as GUI animation and testing, GUI search and code generation.
Existing studies for GUI element detection directly borrow the mature methods
from computer vision (CV) domain, including old fashioned ones that rely on
traditional image processing features (e.g., canny edge, contours), and deep
learning models that learn to detect from large-scale GUI data. Unfortunately,
these CV methods are not originally designed with the awareness of the unique
characteristics of GUIs and GUI elements and the high localization accuracy of
the GUI element detection task. We conduct the first large-scale empirical
study of seven representative GUI element detection methods on over 50k GUI
images to understand the capabilities, limitations and effective designs of
these methods. This study not only sheds the light on the technical challenges
to be addressed but also informs the design of new GUI element detection
methods. We accordingly design a new GUI-specific old-fashioned method for
non-text GUI element detection which adopts a novel top-down coarse-to-fine
strategy, and incorporate it with the mature deep learning model for GUI text
detection.Our evaluation on 25,000 GUI images shows that our method
significantly advances the start-of-the-art performance in GUI element
detection.
Related papers
- GUI Element Detection Using SOTA YOLO Deep Learning Models [5.835026544704744]
Detection of Graphical User Interface (GUI) elements is a crucial task for automatic code generation from images and sketches, GUI testing, and GUI search.
Recent studies have leveraged both old-fashioned and modern computer vision (CV) techniques.
In this study, we evaluate the performance of the four most recent successful YOLO models for general object detection tasks on GUI element detection.
arXiv Detail & Related papers (2024-08-07T02:18:39Z) - GUICourse: From General Vision Language Models to Versatile GUI Agents [75.5150601913659]
We contribute GUICourse, a suite of datasets to train visual-based GUI agents.
First, we introduce the GUIEnv dataset to strengthen the OCR and grounding capabilities of VLMs.
Then, we introduce the GUIAct and GUIChat datasets to enrich their knowledge of GUI components and interactions.
arXiv Detail & Related papers (2024-06-17T08:30:55Z) - GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents [73.9254861755974]
This paper introduces a new dataset, called GUI-World, which features meticulously crafted Human-MLLM annotations.
We evaluate the capabilities of current state-of-the-art MLLMs, including ImageLLMs and VideoLLMs, in understanding various types of GUI content.
arXiv Detail & Related papers (2024-06-16T06:56:53Z) - VideoGUI: A Benchmark for GUI Automation from Instructional Videos [78.97292966276706]
VideoGUI is a novel multi-modal benchmark designed to evaluate GUI assistants on visual-centric GUI tasks.
Sourced from high-quality web instructional videos, our benchmark focuses on tasks involving professional and novel software.
Our evaluation reveals that even the SoTA large multimodal model GPT4o performs poorly on visual-centric GUI tasks.
arXiv Detail & Related papers (2024-06-14T17:59:08Z) - Interlinking User Stories and GUI Prototyping: A Semi-Automatic LLM-based Approach [55.762798168494726]
We present a novel Large Language Model (LLM)-based approach for validating the implementation of functional NL-based requirements in a graphical user interface (GUI) prototype.
Our approach aims to detect functional user stories that are not implemented in a GUI prototype and provides recommendations for suitable GUI components directly implementing the requirements.
arXiv Detail & Related papers (2024-06-12T11:59:26Z) - GUing: A Mobile GUI Search Engine using a Vision-Language Model [6.024602799136753]
This paper proposes GUing, a GUI search engine based on a vision-language model called GUIClip.
We first collected from Google Play app introduction images which display the most representative screenshots.
Then, we developed an automated pipeline to classify, crop, and extract the captions from these images.
We used this dataset to train a novel vision-language model, which is, to the best of our knowledge, the first of its kind for GUI retrieval.
arXiv Detail & Related papers (2024-04-30T18:42:18Z) - SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents [17.43878828389188]
We propose a novel visual Graphical User Interface (GUI) agent, SeeClick, which only relies on screenshots for task automation.
To tackle this challenge, we propose to enhance SeeClick with GUI grounding pre-training and devise a method to automate the curation of GUI grounding data.
We have also created ScreenSpot, the first realistic GUI grounding benchmark that encompasses mobile, desktop, and web environments.
arXiv Detail & Related papers (2024-01-17T08:10:35Z) - CogAgent: A Visual Language Model for GUI Agents [61.26491779502794]
We introduce CogAgent, a visual language model (VLM) specializing in GUI understanding and navigation.
By utilizing both low-resolution and high-resolution image encoders, CogAgent supports input at a resolution of 1120*1120.
CogAgent achieves the state of the art on five text-rich and four general VQA benchmarks, including VQAv2, OK-VQA, Text-VQA, ST-VQA, ChartQA, infoVQA, DocVQA, MM-Vet, and POPE.
arXiv Detail & Related papers (2023-12-14T13:20:57Z) - Vision-Based Mobile App GUI Testing: A Survey [29.042723121518765]
Vision-based mobile app GUI testing approaches emerged with the development of computer vision technologies.
We provide a comprehensive investigation of the state-of-the-art techniques on 271 papers, among which 92 are vision-based studies.
arXiv Detail & Related papers (2023-10-20T14:04:04Z) - From Pixels to UI Actions: Learning to Follow Instructions via Graphical
User Interfaces [66.85108822706489]
This paper focuses on creating agents that interact with the digital world using the same conceptual interface that humans commonly use.
It is possible for such agents to outperform human crowdworkers on the MiniWob++ benchmark of GUI-based instruction following tasks.
arXiv Detail & Related papers (2023-05-31T23:39:18Z) - GUILGET: GUI Layout GEneration with Transformer [26.457270239234383]
The goal is to support the initial step of GUI design by producing realistic and diverse GUI layouts.
GUILGET is based on transformers in order to capture the semantic in relationships between elements from GUI-AG.
Our experiments, which are conducted on the CLAY dataset, reveal that our model has the best understanding of relationships from GUI-AG.
arXiv Detail & Related papers (2023-04-18T14:27:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.