Related papers: Large Language Models for Mobile GUI Text Input Generation: An Empirical Study

Large Language Models for Mobile GUI Text Input Generation: An Empirical Study

URL: http://arxiv.org/abs/2404.08948v1
Date: Sat, 13 Apr 2024 09:56:50 GMT
Title: Large Language Models for Mobile GUI Text Input Generation: An Empirical Study
Authors: Chenhui Cui, Tao Li, Junjie Wang, Chunyang Chen, Dave Towey, Rubing Huang,
Abstract summary: Large Language Models (LLMs) have shown excellent text-generation capabilities. This paper extensively investigates the effectiveness of nine state-of-the-art LLMs in Android text-input generation for UI pages.
Score: 24.256184336154544
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Mobile applications (apps) have become an essential part of our daily lives, making ensuring their quality an important activity. GUI testing, a quality assurance method, has frequently been used for mobile apps. When conducting GUI testing, it is important to generate effective text inputs for the text-input components. Some GUIs require these text inputs to move from one page to the next, which remains a challenge to achieving complete UI exploration. Recently, Large Language Models (LLMs) have shown excellent text-generation capabilities. Among the LLMs, OpenAI's GPT series has been widely discussed and used. However, it may not be possible to use these LLMs for GUI testing actual mobile apps, due to the security and privacy issues related to the production data. Therefore, it is necessary to explore the potential of different LLMs to guide text-input generation in mobile GUI testing. This paper reports on a large-scale empirical study that extensively investigates the effectiveness of nine state-of-the-art LLMs in Android text-input generation for UI pages. We collected 114 UI pages from 62 open-source Android apps and extracted contextual information from the UI pages to construct prompts for LLMs to generate text inputs. The experimental results show that some LLMs can generate relatively more effective and higher-quality text inputs, achieving a 50.58% to 66.67% page-pass-through rate, and even detecting some real bugs in open-source apps. Compared with the GPT-3.5 and GPT-4 LLMs, other LLMs reduce the page-pass-through rates by 17.97% to 84.79% and 21.93% to 85.53%, respectively. We also found that using more complete UI contextual information can increase the page-pass-through rates of LLMs for generating text inputs. In addition, we also describe six insights gained regarding the use of LLMs for Android testing: These insights will benefit the Android testing community.

Related papers

GigaCheck: Detecting LLM-generated Content [72.27323884094953]
In this work, we investigate the task of generated text detection by proposing the GigaCheck. Our research explores two approaches: (i) distinguishing human-written texts from LLM-generated ones, and (ii) detecting LLM-generated intervals in Human-Machine collaborative texts. Specifically, we use a fine-tuned general-purpose LLM in conjunction with a DETR-like detection model, adapted from computer vision, to localize AI-generated intervals within text.
arXiv Detail & Related papers (2024-10-31T08:30:55Z)
SMLT-MUGC: Small, Medium, and Large Texts -- Machine versus User-Generated Content Detection and Comparison [2.7147912878168303]
We compare the performance of machine learning algorithms on four datasets: (1) small (tweets from Election, FIFA, and Game of Thrones), (2) medium (Wikipedia introductions and PubMed abstracts), and (3) large (OpenAI web text dataset) Our results indicate that LLMs with very large parameters (such as the XL-1542 variant of GPT2 with 1542 million parameters) were harder to detect using traditional machine learning methods. We examine the characteristics of human and machine-generated texts across multiple dimensions, including linguistics, personality, sentiment, bias, and morality.
arXiv Detail & Related papers (2024-06-28T22:19:01Z)
ReMoDetect: Reward Models Recognize Aligned LLM's Generations [55.06804460642062]
Large language models (LLMs) generate human-preferable texts. In this paper, we identify the common characteristics shared by these models. We propose two training schemes to further improve the detection ability of the reward model.
arXiv Detail & Related papers (2024-05-27T17:38:33Z)
Who Wrote This? The Key to Zero-Shot LLM-Generated Text Detection Is GECScore [51.65730053591696]
We propose a simple yet effective black-box zero-shot detection approach based on the observation that human-written texts typically contain more grammatical errors than LLM-generated texts. Experimental results show that our method outperforms current state-of-the-art (SOTA) zero-shot and supervised methods.
arXiv Detail & Related papers (2024-05-07T12:57:01Z)
Unsupervised Information Refinement Training of Large Language Models for Retrieval-Augmented Generation [128.01050030936028]
We propose an information refinement training method named InFO-RAG. InFO-RAG is low-cost and general across various tasks. It improves the performance of LLaMA2 by an average of 9.39% relative points.
arXiv Detail & Related papers (2024-02-28T08:24:38Z)
Make LLM a Testing Expert: Bringing Human-like Interaction to Mobile GUI Testing via Functionality-aware Decisions [23.460051600514806]
GPTDroid is a Q&A-based GUI testing framework for mobile apps. We introduce a functionality-aware memory prompting mechanism. It outperforms the best baseline by 32% in activity coverage, and detects 31% more bugs at a faster rate.
arXiv Detail & Related papers (2023-10-24T12:30:26Z)
Testing the Limits: Unusual Text Inputs Generation for Mobile App Crash Detection with Large Language Model [23.460051600514806]
This paper proposes InputBlaster to automatically generate unusual text inputs for mobile app crash detection. It formulates the unusual inputs generation problem as a task of producing a set of test generators, each of which can yield a batch of unusual text inputs. It is evaluated on 36 text input widgets with cash bugs involving 31 popular Android apps, and results show that it achieves 78% bug detection rate, with 136% higher than the best baseline.
arXiv Detail & Related papers (2023-10-24T09:10:51Z)
Detecting LLM-Generated Text in Computing Education: A Comparative Study for ChatGPT Cases [0.0]
Large Language Models (LLMs) have posed a serious threat to academic integrity in education. Modern detectors are still in need of improvements so that they can offer a full-proof solution to help maintain academic integrity.
arXiv Detail & Related papers (2023-07-10T12:18:34Z)
Open-Source LLMs for Text Annotation: A Practical Guide for Model Setting and Fine-Tuning [5.822010906632045]
This paper studies the performance of open-source Large Language Models (LLMs) in text classification tasks typical for political science research. By examining tasks like stance, topic, and relevance classification, we aim to guide scholars in making informed decisions about their use of LLMs for text analysis.
arXiv Detail & Related papers (2023-07-05T10:15:07Z)
Harnessing Explanations: LLM-to-LM Interpreter for Enhanced Text-Attributed Graph Representation Learning [51.90524745663737]
A key innovation is our use of explanations as features, which can be used to boost GNN performance on downstream tasks. Our method achieves state-of-the-art results on well-established TAG datasets. Our method significantly speeds up training, achieving a 2.88 times improvement over the closest baseline on ogbn-arxiv.
arXiv Detail & Related papers (2023-05-31T03:18:03Z)
LLMDet: A Third Party Large Language Models Generated Text Detection Tool [119.0952092533317]
Large language models (LLMs) are remarkably close to high-quality human-authored text. Existing detection tools can only differentiate between machine-generated and human-authored text. We propose LLMDet, a model-specific, secure, efficient, and extendable detection tool.
arXiv Detail & Related papers (2023-05-24T10:45:16Z)
Chatting with GPT-3 for Zero-Shot Human-Like Mobile Automated GUI Testing [23.460051600514806]
We propose GPTDroid, asking Large Language Model to chat with the mobile apps by passing the GUI page information to LLM to elicit testing scripts. Within it, we extract the static context of the GUI page and the dynamic context of the iterative testing process. We evaluate GPTDroid on 86 apps from Google Play, and its activity coverage is 71%, with 32% higher than the best baseline, and can detect 36% more bugs with faster speed than the best baseline.
arXiv Detail & Related papers (2023-05-16T13:46:52Z)
A Survey of Pretrained Language Models Based Text Generation [97.64625999380425]
Text Generation aims to produce plausible and readable text in human language from input data. Deep learning has greatly advanced this field by neural generation models, especially the paradigm of pretrained language models (PLMs) Grounding text generation on PLMs is seen as a promising direction in both academia and industry.
arXiv Detail & Related papers (2022-01-14T01:44:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.