Chatting with GPT-3 for Zero-Shot Human-Like Mobile Automated GUI
Testing
- URL: http://arxiv.org/abs/2305.09434v1
- Date: Tue, 16 May 2023 13:46:52 GMT
- Title: Chatting with GPT-3 for Zero-Shot Human-Like Mobile Automated GUI
Testing
- Authors: Zhe Liu, Chunyang Chen, Junjie Wang, Mengzhuo Chen, Boyu Wu, Xing Che,
Dandan Wang, Qing Wang
- Abstract summary: We propose GPTDroid, asking Large Language Model to chat with the mobile apps by passing the GUI page information to LLM to elicit testing scripts.
Within it, we extract the static context of the GUI page and the dynamic context of the iterative testing process.
We evaluate GPTDroid on 86 apps from Google Play, and its activity coverage is 71%, with 32% higher than the best baseline, and can detect 36% more bugs with faster speed than the best baseline.
- Score: 23.460051600514806
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Mobile apps are indispensable for people's daily life, and automated GUI
(Graphical User Interface) testing is widely used for app quality assurance.
There is a growing interest in using learning-based techniques for automated
GUI testing which aims at generating human-like actions and interactions.
However, the limitations such as low testing coverage, weak generalization, and
heavy reliance on training data, make an urgent need for a more effective
approach to generate human-like actions to thoroughly test mobile apps.
Inspired by the success of the Large Language Model (LLM), e.g., GPT-3 and
ChatGPT, in natural language understanding and question answering, we formulate
the mobile GUI testing problem as a Q&A task. We propose GPTDroid, asking LLM
to chat with the mobile apps by passing the GUI page information to LLM to
elicit testing scripts, and executing them to keep passing the app feedback to
LLM, iterating the whole process. Within it, we extract the static context of
the GUI page and the dynamic context of the iterative testing process, design
prompts for inputting this information to LLM, and develop a neural matching
network to decode the LLM's output into actionable steps to execute the app. We
evaluate GPTDroid on 86 apps from Google Play, and its activity coverage is
71%, with 32% higher than the best baseline, and can detect 36% more bugs with
faster speed than the best baseline. GPTDroid also detects 48 new bugs on the
Google Play with 25 of them being confirmed/fixed. We further summarize the
capabilities of GPTDroid behind the superior performance, including semantic
text input, compound action, long meaningful test trace, and test case
prioritization.
Related papers
- AutoPT: How Far Are We from the End2End Automated Web Penetration Testing? [54.65079443902714]
We introduce AutoPT, an automated penetration testing agent based on the principle of PSM driven by LLMs.
Our results show that AutoPT outperforms the baseline framework ReAct on the GPT-4o mini model.
arXiv Detail & Related papers (2024-11-02T13:24:30Z) - Are you still on track!? Catching LLM Task Drift with Activations [55.75645403965326]
Task drift allows attackers to exfiltrate data or influence the LLM's output for other users.
We show that a simple linear classifier can detect drift with near-perfect ROC AUC on an out-of-distribution test set.
We observe that this approach generalizes surprisingly well to unseen task domains, such as prompt injections, jailbreaks, and malicious instructions.
arXiv Detail & Related papers (2024-06-02T16:53:21Z) - Large Language Models for Mobile GUI Text Input Generation: An Empirical Study [24.256184336154544]
Large Language Models (LLMs) have shown excellent text-generation capabilities.
This paper extensively investigates the effectiveness of nine state-of-the-art LLMs in Android text-input generation for UI pages.
arXiv Detail & Related papers (2024-04-13T09:56:50Z) - ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks.
To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z) - Autonomous Large Language Model Agents Enabling Intent-Driven Mobile GUI
Testing [17.24045904273874]
We propose DroidAgent, an autonomous GUI testing agent for Android.
It is based on Large Language Models and support mechanisms such as long- and short-term memory.
DroidAgent achieved 61% activity coverage, compared to 51% for current state-of-the-art GUI testing techniques.
arXiv Detail & Related papers (2023-11-15T01:59:40Z) - Make LLM a Testing Expert: Bringing Human-like Interaction to Mobile GUI
Testing via Functionality-aware Decisions [23.460051600514806]
GPTDroid is a Q&A-based GUI testing framework for mobile apps.
We introduce a functionality-aware memory prompting mechanism.
It outperforms the best baseline by 32% in activity coverage, and detects 31% more bugs at a faster rate.
arXiv Detail & Related papers (2023-10-24T12:30:26Z) - Testing the Limits: Unusual Text Inputs Generation for Mobile App Crash
Detection with Large Language Model [23.460051600514806]
This paper proposes InputBlaster to automatically generate unusual text inputs for mobile app crash detection.
It formulates the unusual inputs generation problem as a task of producing a set of test generators, each of which can yield a batch of unusual text inputs.
It is evaluated on 36 text input widgets with cash bugs involving 31 popular Android apps, and results show that it achieves 78% bug detection rate, with 136% higher than the best baseline.
arXiv Detail & Related papers (2023-10-24T09:10:51Z) - MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation [96.71370747681078]
We introduce MLAgentBench, a suite of 13 tasks ranging from improving model performance on CIFAR-10 to recent research problems like BabyLM.
For each task, an agent can perform actions like reading/writing files, executing code, and inspecting outputs.
We benchmark agents based on Claude v1.0, Claude v2.1, Claude v3 Opus, GPT-4, GPT-4-turbo, Gemini-Pro, and Mixtral and find that a Claude v3 Opus agent is the best in terms of success rate.
arXiv Detail & Related papers (2023-10-05T04:06:12Z) - DroidBot-GPT: GPT-powered UI Automation for Android [11.980924738484994]
DroidBot-GPT is a tool that utilizes GPT-like large language models (LLMs) to automate the interactions with Android mobile applications.
Given a natural language description of a desired task, DroidBot-GPT can automatically generate and execute actions that navigate the app to complete the task.
arXiv Detail & Related papers (2023-04-14T11:31:56Z) - A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on
Reasoning, Hallucination, and Interactivity [79.12003701981092]
We carry out an extensive technical evaluation of ChatGPT using 23 data sets covering 8 different common NLP application tasks.
We evaluate the multitask, multilingual and multi-modal aspects of ChatGPT based on these data sets and a newly designed multimodal dataset.
ChatGPT is 63.41% accurate on average in 10 different reasoning categories under logical reasoning, non-textual reasoning, and commonsense reasoning.
arXiv Detail & Related papers (2023-02-08T12:35:34Z) - MATE: Masked Autoencoders are Online 3D Test-Time Learners [63.3907730920114]
MATE is the first Test-Time-Training (TTT) method designed for 3D data.
It makes deep networks trained for point cloud classification robust to distribution shifts occurring in test data.
arXiv Detail & Related papers (2022-11-21T13:19:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.