Related papers: Leveraging Large Vision Language Model For Better Automatic Web GUI Testing

Leveraging Large Vision Language Model For Better Automatic Web GUI Testing

URL: http://arxiv.org/abs/2410.12157v1
Date: Wed, 16 Oct 2024 01:37:58 GMT
Title: Leveraging Large Vision Language Model For Better Automatic Web GUI Testing
Authors: Siyi Wang, Sinan Wang, Yujia Fan, Xiaolei Li, Yepang Liu,
Abstract summary: This paper proposes VETL, the first LVLM-driven endtoend web testing technique. With LVLM's scene understanding capabilities, VETL can generate valid and meaningful text inputs focusing on the local context. The selection of associated GUI elements is formulated as a visual question-answering problem, allowing LVLM to capture the logical connection between the input box and the relevant element.
Score: 7.480576630392405
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the rapid development of web technology, more and more software applications have become web-based in the past decades. To ensure software quality and user experience, various techniques have been proposed to automatically test web applications by interacting with their GUIs. To achieve high functional coverage, web GUI testing tools often need to generate high-quality text inputs and interact with the associated GUI elements (e.g., click submit buttons). However, developing a holistic approach that solves both subtasks is challenging because the web GUI context can be complicated and highly dynamic, which makes it hard to process programmatically. The recent development of large vision-language models (LVLM) provides new opportunities to handle these longstanding problems. This paper proposes VETL, the first LVLM-driven end-to-end web testing technique. With LVLM's scene understanding capabilities, VETL can generate valid and meaningful text inputs focusing on the local context, while avoiding the need to extract precise textual attributes. The selection of associated GUI elements is formulated as a visual question-answering problem, allowing LVLM to capture the logical connection between the input box and the relevant element based on visual instructions. Further, the GUI exploration is guided by a multi-armed bandit module employing a curiosity-oriented strategy. Experiments show that VETL effectively explores web state/action spaces and detects bugs. Compared with WebExplor, the state-of-the-art web testing technique, VETL can discover 25% more unique web actions on benchmark websites. Moreover, it can expose functional bugs in top-ranking commercial websites, which the website maintainers have confirmed. Our work makes the first attempt at leveraging LVLM in end-to-end GUI testing, demonstrating promising results in this research direction.

Related papers

TRISHUL: Towards Region Identification and Screen Hierarchy Understanding for Large VLM based GUI Agents [0.6827423171182154]
TRISHUL is a training-free framework that enhances generalist LVLMs for holistic GUI comprehension. Our results demonstrate TRISHUL's superior performance in action grounding across the ScreenSpot, VisualWebBench, AITW, and Mind2Web datasets. For GUI referring, TRISHUL surpasses the ToL agent on the ScreenPR benchmark, setting a new standard for robust and adaptable GUI comprehension.
arXiv Detail & Related papers (2025-02-12T09:12:30Z)
AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials [53.376263056033046]
Existing approaches rely on expensive human annotation, making them unsustainable at scale. We propose AgentTrek, a scalable data synthesis pipeline that generates web agent trajectories by leveraging publicly available tutorials. Our fully automated approach significantly reduces data collection costs, achieving a cost of just $0.55 per high-quality trajectory without human annotators.
arXiv Detail & Related papers (2024-12-12T18:59:27Z)
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction [69.57190742976091]
We introduce Aguvis, a unified vision-based framework for autonomous GUI agents. Our approach leverages image-based observations, and grounding instructions in natural language to visual elements. To address the limitations of previous work, we integrate explicit planning and reasoning within the model.
arXiv Detail & Related papers (2024-12-05T18:58:26Z)
Ponder & Press: Advancing Visual GUI Agent towards General Computer Control [13.39115823642937]
Ponder & Press is a divide-and-conquer framework for general computer control using only visual input. Our agent offers a versatile, human-like interaction paradigm applicable to a wide range of applications.
arXiv Detail & Related papers (2024-12-02T08:35:31Z)
Large Language Model-Brained GUI Agents: A Survey [42.82362907348966]
multimodal models have ushered in a new era of GUI automation. They have demonstrated exceptional capabilities in natural language understanding, code generation, and visual processing. These agents represent a paradigm shift, enabling users to perform intricate, multi-step tasks through simple conversational commands.
arXiv Detail & Related papers (2024-11-27T12:13:39Z)
ShowUI: One Vision-Language-Action Model for GUI Visual Agent [80.50062396585004]
Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity. We develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations. ShowUI, a lightweight 2B model using 256K data, achieves a strong 75.1% accuracy in zero-shot screenshot grounding.
arXiv Detail & Related papers (2024-11-26T14:29:47Z)
EDGE: Enhanced Grounded GUI Understanding with Enriched Multi-Granularity Synthetic Data [15.801018643716437]
This paper aims to enhance the GUI understanding and interacting capabilities of large vision-language models (LVLMs) through a data-driven approach. We propose EDGE, a general data synthesis framework that automatically generates large-scale, multi-granularity training data from webpages across the Web. Our approach significantly reduces the dependence on manual annotations, empowering researchers to harness the vast public resources available on the Web to advance their work.
arXiv Detail & Related papers (2024-10-25T10:46:17Z)
Harnessing Webpage UIs for Text-Rich Visual Understanding [112.01029887404296]
We propose synthesizing general multimodal instructions from webpage UIs using text-based large language models (LLMs) These instructions are then paired with UI screenshots to train multimodal models. We introduce MultiUI, a dataset containing 7.3 million samples from 1 million websites, covering diverse multimodal tasks and UI layouts.
arXiv Detail & Related papers (2024-10-17T17:48:54Z)
Vision-driven Automated Mobile GUI Testing via Multimodal Large Language Model [27.97964877860671]
This paper proposes a vision-driven automated GUI testing approach to detect non-crash functional bugs with Multimodal Large Language Models. It begins by extracting GUI text information and aligning it with screenshots to form a vision prompt, enabling MLLM to understand GUI context. VisionDroid identifies 29 new bugs on Google Play, of which 19 have been confirmed and fixed.
arXiv Detail & Related papers (2024-07-03T11:58:09Z)
Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs [112.89665642941814]
Multimodal large language models (MLLMs) have shown impressive success across modalities such as image, video, and audio. Current MLLMs are surprisingly poor at understanding webpage screenshots and generating their corresponding HTML code. We propose a benchmark consisting of a new large-scale webpage-to-code dataset for instruction tuning.
arXiv Detail & Related papers (2024-06-28T17:59:46Z)
GUI Action Narrator: Where and When Did That Action Take Place? [19.344324166716245]
We develop a video captioning benchmark for GUI actions, comprising 4,189 diverse video captioning samples. This task presents unique challenges compared to natural scene video captioning. We introduce our GUI action dataset textbfAct2Cap as well as a simple yet effective framework, textbfGUI Narrator, for GUI video captioning.
arXiv Detail & Related papers (2024-06-19T17:22:11Z)
GUICourse: From General Vision Language Models to Versatile GUI Agents [75.5150601913659]
We contribute GUICourse, a suite of datasets to train visual-based GUI agents. First, we introduce the GUIEnv dataset to strengthen the OCR and grounding capabilities of VLMs. Then, we introduce the GUIAct and GUIChat datasets to enrich their knowledge of GUI components and interactions.
arXiv Detail & Related papers (2024-06-17T08:30:55Z)
GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents [73.9254861755974]
This paper introduces a new dataset, called GUI-World, which features meticulously crafted Human-MLLM annotations. We evaluate the capabilities of current state-of-the-art MLLMs, including ImageLLMs and VideoLLMs, in understanding various types of GUI content.
arXiv Detail & Related papers (2024-06-16T06:56:53Z)
VideoGUI: A Benchmark for GUI Automation from Instructional Videos [78.97292966276706]
VideoGUI is a novel multi-modal benchmark designed to evaluate GUI assistants on visual-centric GUI tasks. Sourced from high-quality web instructional videos, our benchmark focuses on tasks involving professional and novel software. Our evaluation reveals that even the SoTA large multimodal model GPT4o performs poorly on visual-centric GUI tasks.
arXiv Detail & Related papers (2024-06-14T17:59:08Z)
"What's important here?": Opportunities and Challenges of Using LLMs in Retrieving Information from Web Interfaces [19.656406003275713]
We study how large language models (LLMs) can be used to retrieve and locate important elements for a user given query in a web interface. Our empirical experiments show that while LLMs exhibit a reasonable level of performance in retrieving important UI elements, there is still a substantial room for improvement.
arXiv Detail & Related papers (2023-12-11T06:26:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.