Leveraging Large Vision Language Model For Better Automatic Web GUI Testing
- URL: http://arxiv.org/abs/2410.12157v1
- Date: Wed, 16 Oct 2024 01:37:58 GMT
- Title: Leveraging Large Vision Language Model For Better Automatic Web GUI Testing
- Authors: Siyi Wang, Sinan Wang, Yujia Fan, Xiaolei Li, Yepang Liu,
- Abstract summary: This paper proposes VETL, the first LVLM-driven endtoend web testing technique.
With LVLM's scene understanding capabilities, VETL can generate valid and meaningful text inputs focusing on the local context.
The selection of associated GUI elements is formulated as a visual question-answering problem, allowing LVLM to capture the logical connection between the input box and the relevant element.
- Score: 7.480576630392405
- License:
- Abstract: With the rapid development of web technology, more and more software applications have become web-based in the past decades. To ensure software quality and user experience, various techniques have been proposed to automatically test web applications by interacting with their GUIs. To achieve high functional coverage, web GUI testing tools often need to generate high-quality text inputs and interact with the associated GUI elements (e.g., click submit buttons). However, developing a holistic approach that solves both subtasks is challenging because the web GUI context can be complicated and highly dynamic, which makes it hard to process programmatically. The recent development of large vision-language models (LVLM) provides new opportunities to handle these longstanding problems. This paper proposes VETL, the first LVLM-driven end-to-end web testing technique. With LVLM's scene understanding capabilities, VETL can generate valid and meaningful text inputs focusing on the local context, while avoiding the need to extract precise textual attributes. The selection of associated GUI elements is formulated as a visual question-answering problem, allowing LVLM to capture the logical connection between the input box and the relevant element based on visual instructions. Further, the GUI exploration is guided by a multi-armed bandit module employing a curiosity-oriented strategy. Experiments show that VETL effectively explores web state/action spaces and detects bugs. Compared with WebExplor, the state-of-the-art web testing technique, VETL can discover 25% more unique web actions on benchmark websites. Moreover, it can expose functional bugs in top-ranking commercial websites, which the website maintainers have confirmed. Our work makes the first attempt at leveraging LVLM in end-to-end GUI testing, demonstrating promising results in this research direction.
Related papers
- TRISHUL: Towards Region Identification and Screen Hierarchy Understanding for Large VLM based GUI Agents [0.6827423171182154]
TRISHUL is a training-free framework that enhances generalist LVLMs for holistic GUI comprehension.
Our results demonstrate TRISHUL's superior performance in action grounding across the ScreenSpot, VisualWebBench, AITW, and Mind2Web datasets.
For GUI referring, TRISHUL surpasses the ToL agent on the ScreenPR benchmark, setting a new standard for robust and adaptable GUI comprehension.
arXiv Detail & Related papers (2025-02-12T09:12:30Z) - WebWalker: Benchmarking LLMs in Web Traversal [64.48425443951749]
We introduce WebWalkerQA, a benchmark designed to assess the ability of LLMs to perform web traversal.
We propose WebWalker, which is a multi-agent framework that mimics human-like web navigation through an explore-critic paradigm.
arXiv Detail & Related papers (2025-01-13T18:58:07Z) - Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction [69.57190742976091]
We introduce Aguvis, a unified vision-based framework for autonomous GUI agents.
Our approach leverages image-based observations, and grounding instructions in natural language to visual elements.
To address the limitations of previous work, we integrate explicit planning and reasoning within the model.
arXiv Detail & Related papers (2024-12-05T18:58:26Z) - Ponder & Press: Advancing Visual GUI Agent towards General Computer Control [13.39115823642937]
Ponder & Press is a divide-and-conquer framework for general computer control using only visual input.
Our agent offers a versatile, human-like interaction paradigm applicable to a wide range of applications.
arXiv Detail & Related papers (2024-12-02T08:35:31Z) - Large Language Model-Brained GUI Agents: A Survey [42.82362907348966]
multimodal models have ushered in a new era of GUI automation.
They have demonstrated exceptional capabilities in natural language understanding, code generation, and visual processing.
These agents represent a paradigm shift, enabling users to perform intricate, multi-step tasks through simple conversational commands.
arXiv Detail & Related papers (2024-11-27T12:13:39Z) - ShowUI: One Vision-Language-Action Model for GUI Visual Agent [80.50062396585004]
Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity.
We develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations.
ShowUI, a lightweight 2B model using 256K data, achieves a strong 75.1% accuracy in zero-shot screenshot grounding.
arXiv Detail & Related papers (2024-11-26T14:29:47Z) - Harnessing Webpage UIs for Text-Rich Visual Understanding [112.01029887404296]
We propose synthesizing general multimodal instructions from webpage UIs using text-based large language models (LLMs)
These instructions are then paired with UI screenshots to train multimodal models.
We introduce MultiUI, a dataset containing 7.3 million samples from 1 million websites, covering diverse multimodal tasks and UI layouts.
arXiv Detail & Related papers (2024-10-17T17:48:54Z) - GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents [73.9254861755974]
This paper introduces a new dataset, called GUI-World, which features meticulously crafted Human-MLLM annotations.
We evaluate the capabilities of current state-of-the-art MLLMs, including ImageLLMs and VideoLLMs, in understanding various types of GUI content.
arXiv Detail & Related papers (2024-06-16T06:56:53Z) - VideoGUI: A Benchmark for GUI Automation from Instructional Videos [78.97292966276706]
VideoGUI is a novel multi-modal benchmark designed to evaluate GUI assistants on visual-centric GUI tasks.
Sourced from high-quality web instructional videos, our benchmark focuses on tasks involving professional and novel software.
Our evaluation reveals that even the SoTA large multimodal model GPT4o performs poorly on visual-centric GUI tasks.
arXiv Detail & Related papers (2024-06-14T17:59:08Z) - "What's important here?": Opportunities and Challenges of Using LLMs in
Retrieving Information from Web Interfaces [19.656406003275713]
We study how large language models (LLMs) can be used to retrieve and locate important elements for a user given query in a web interface.
Our empirical experiments show that while LLMs exhibit a reasonable level of performance in retrieving important UI elements, there is still a substantial room for improvement.
arXiv Detail & Related papers (2023-12-11T06:26:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.