Related papers: Chatting with GPT-3 for Zero-Shot Human-Like Mobile Automated GUI Testing

Chatting with GPT-3 for Zero-Shot Human-Like Mobile Automated GUI Testing

URL: http://arxiv.org/abs/2305.09434v1
Date: Tue, 16 May 2023 13:46:52 GMT
Title: Chatting with GPT-3 for Zero-Shot Human-Like Mobile Automated GUI Testing
Authors: Zhe Liu, Chunyang Chen, Junjie Wang, Mengzhuo Chen, Boyu Wu, Xing Che, Dandan Wang, Qing Wang
Abstract summary: We propose GPTDroid, asking Large Language Model to chat with the mobile apps by passing the GUI page information to LLM to elicit testing scripts. Within it, we extract the static context of the GUI page and the dynamic context of the iterative testing process. We evaluate GPTDroid on 86 apps from Google Play, and its activity coverage is 71%, with 32% higher than the best baseline, and can detect 36% more bugs with faster speed than the best baseline.
Score: 23.460051600514806
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Mobile apps are indispensable for people's daily life, and automated GUI (Graphical User Interface) testing is widely used for app quality assurance. There is a growing interest in using learning-based techniques for automated GUI testing which aims at generating human-like actions and interactions. However, the limitations such as low testing coverage, weak generalization, and heavy reliance on training data, make an urgent need for a more effective approach to generate human-like actions to thoroughly test mobile apps. Inspired by the success of the Large Language Model (LLM), e.g., GPT-3 and ChatGPT, in natural language understanding and question answering, we formulate the mobile GUI testing problem as a Q&A task. We propose GPTDroid, asking LLM to chat with the mobile apps by passing the GUI page information to LLM to elicit testing scripts, and executing them to keep passing the app feedback to LLM, iterating the whole process. Within it, we extract the static context of the GUI page and the dynamic context of the iterative testing process, design prompts for inputting this information to LLM, and develop a neural matching network to decode the LLM's output into actionable steps to execute the app. We evaluate GPTDroid on 86 apps from Google Play, and its activity coverage is 71%, with 32% higher than the best baseline, and can detect 36% more bugs with faster speed than the best baseline. GPTDroid also detects 48 new bugs on the Google Play with 25 of them being confirmed/fixed. We further summarize the capabilities of GPTDroid behind the superior performance, including semantic text input, compound action, long meaningful test trace, and test case prioritization.

Related papers

VLM-Fuzz: Vision Language Model Assisted Recursive Depth-first Search Exploration for Effective UI Testing of Android Apps [6.122273281101832]
Testing Android apps effectively requires a systematic exploration of the app's possible states. We propose a novel, automated fuzzing approach called VLM-Fuzz for effective UI testing of Android apps.
arXiv Detail & Related papers (2025-04-16T00:19:31Z)
GUI-Bee: Align GUI Action Grounding to Novel Environments via Autonomous Exploration [56.58744345634623]
We propose GUI-Bee, an MLLM-based autonomous agent, to collect high-quality, environment-specific data through exploration. We also introduce NovelScreenSpot, a benchmark for testing how well the data can help align GUI action grounding models to novel environments.
arXiv Detail & Related papers (2025-01-23T18:16:21Z)
AutoPT: How Far Are We from the End2End Automated Web Penetration Testing? [54.65079443902714]
We introduce AutoPT, an automated penetration testing agent based on the principle of PSM driven by LLMs. Our results show that AutoPT outperforms the baseline framework ReAct on the GPT-4o mini model.
arXiv Detail & Related papers (2024-11-02T13:24:30Z)
AutoGLM: Autonomous Foundation Agents for GUIs [51.276965515952]
We present AutoGLM, a new series in the ChatGLM family, designed to serve as foundation agents for autonomous control of digital devices through Graphical User Interfaces (GUIs) We have developed AutoGLM as a practical foundation agent system for real-world GUI interactions. Our evaluations demonstrate AutoGLM's effectiveness across multiple domains.
arXiv Detail & Related papers (2024-10-28T17:05:10Z)
Are you still on track!? Catching LLM Task Drift with Activations [55.75645403965326]
Task drift allows attackers to exfiltrate data or influence the LLM's output for other users. We show that a simple linear classifier can detect drift with near-perfect ROC AUC on an out-of-distribution test set. We observe that this approach generalizes surprisingly well to unseen task domains, such as prompt injections, jailbreaks, and malicious instructions.
arXiv Detail & Related papers (2024-06-02T16:53:21Z)
Large Language Models for Mobile GUI Text Input Generation: An Empirical Study [24.256184336154544]
Large Language Models (LLMs) have shown excellent text-generation capabilities. This paper extensively investigates the effectiveness of nine state-of-the-art LLMs in Android text-input generation for UI pages.
arXiv Detail & Related papers (2024-04-13T09:56:50Z)
ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks. To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z)
Autonomous Large Language Model Agents Enabling Intent-Driven Mobile GUI Testing [17.24045904273874]
We propose DroidAgent, an autonomous GUI testing agent for Android. It is based on Large Language Models and support mechanisms such as long- and short-term memory. DroidAgent achieved 61% activity coverage, compared to 51% for current state-of-the-art GUI testing techniques.
arXiv Detail & Related papers (2023-11-15T01:59:40Z)
Make LLM a Testing Expert: Bringing Human-like Interaction to Mobile GUI Testing via Functionality-aware Decisions [23.460051600514806]
GPTDroid is a Q&A-based GUI testing framework for mobile apps. We introduce a functionality-aware memory prompting mechanism. It outperforms the best baseline by 32% in activity coverage, and detects 31% more bugs at a faster rate.
arXiv Detail & Related papers (2023-10-24T12:30:26Z)
Testing the Limits: Unusual Text Inputs Generation for Mobile App Crash Detection with Large Language Model [23.460051600514806]
This paper proposes InputBlaster to automatically generate unusual text inputs for mobile app crash detection. It formulates the unusual inputs generation problem as a task of producing a set of test generators, each of which can yield a batch of unusual text inputs. It is evaluated on 36 text input widgets with cash bugs involving 31 popular Android apps, and results show that it achieves 78% bug detection rate, with 136% higher than the best baseline.
arXiv Detail & Related papers (2023-10-24T09:10:51Z)
DroidBot-GPT: GPT-powered UI Automation for Android [11.980924738484994]
DroidBot-GPT is a tool that utilizes GPT-like large language models (LLMs) to automate the interactions with Android mobile applications. Given a natural language description of a desired task, DroidBot-GPT can automatically generate and execute actions that navigate the app to complete the task.
arXiv Detail & Related papers (2023-04-14T11:31:56Z)
A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity [79.12003701981092]
We carry out an extensive technical evaluation of ChatGPT using 23 data sets covering 8 different common NLP application tasks. We evaluate the multitask, multilingual and multi-modal aspects of ChatGPT based on these data sets and a newly designed multimodal dataset. ChatGPT is 63.41% accurate on average in 10 different reasoning categories under logical reasoning, non-textual reasoning, and commonsense reasoning.
arXiv Detail & Related papers (2023-02-08T12:35:34Z)
MATE: Masked Autoencoders are Online 3D Test-Time Learners [63.3907730920114]
MATE is the first Test-Time-Training (TTT) method designed for 3D data. It makes deep networks trained for point cloud classification robust to distribution shifts occurring in test data.
arXiv Detail & Related papers (2022-11-21T13:19:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.