Related papers: LELANTE: LEveraging LLM for Automated ANdroid TEsting

LELANTE: LEveraging LLM for Automated ANdroid TEsting

URL: http://arxiv.org/abs/2504.20896v1
Date: Tue, 29 Apr 2025 16:13:49 GMT
Title: LELANTE: LEveraging LLM for Automated ANdroid TEsting
Authors: Shamit Fatin, Mehbubul Hasan Al-Quvi, Haz Sameen Shahgir, Sukarna Barua, Anindya Iqbal, Sadia Sharmin, Md. Mostofa Akbar, Kallol Kumar Pal, A. Asif Al Rashid,
Abstract summary: Existing testing approaches require developers to manually write scripts using tools such as Appium and Espresso to execute the corresponding test case.<n>We introduce LELANTE, a novel framework that utilizes large language models (LLMs) to automate test case execution without requiring pre-written scripts.<n>In experiments across 390 test cases spanning 10 popular Android applications, LELANTE achieved a 73% test execution success rate.
Score: 6.112769800569302
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Given natural language test case description for an Android application, existing testing approaches require developers to manually write scripts using tools such as Appium and Espresso to execute the corresponding test case. This process is labor-intensive and demands significant effort to maintain as UI interfaces evolve throughout development. In this work, we introduce LELANTE, a novel framework that utilizes large language models (LLMs) to automate test case execution without requiring pre-written scripts. LELANTE interprets natural language test case descriptions, iteratively generate action plans, and perform the actions directly on the Android screen using its GUI. LELANTE employs a screen refinement process to enhance LLM interpretability, constructs a structured prompt for LLMs, and implements an action generation mechanism based on chain-of-thought reasoning of LLMs. To further reduce computational cost and enhance scalability, LELANTE utilizes model distillation using a foundational LLM. In experiments across 390 test cases spanning 10 popular Android applications, LELANTE achieved a 73% test execution success rate. Our results demonstrate that LLMs can effectively bridge the gap between natural language test case description and automated execution, making mobile testing more scalable and adaptable.

Related papers

LLM-AutoDiff: Auto-Differentiate Any LLM Workflow [58.56731133392544]
We introduce LLM-AutoDiff: a novel framework for Automatic Prompt Engineering (APE) LLMs-AutoDiff treats each textual input as a trainable parameter and uses a frozen backward engine to generate feedback-akin to textual gradients. It consistently outperforms existing textual gradient baselines in both accuracy and training cost.
arXiv Detail & Related papers (2025-01-28T03:18:48Z)
Test-driven Software Experimentation with LASSO: an LLM Benchmarking Example [1.4685355149711299]
Test-Driven Software Experiments ( TDSEs) are experiments that involve the execution of software subjects and the observation and analysis of their "de facto" run-time behavior. We present a general-purpose analysis platform called LASSO that provides a minimal set of domain-specific languages and data structures to conduct TDSEs.
arXiv Detail & Related papers (2024-10-11T15:32:48Z)
SimulBench: Evaluating Language Models with Creative Simulation Tasks [20.233111652638637]
We introduce SimulBench, a benchmark designed to evaluate large language models (LLMs) across a diverse collection of creative simulation scenarios. A major challenge is to develop an evaluation framework for testing different LLMs fairly while preserving the multi-round interactive nature of simulation tasks between users and AI.
arXiv Detail & Related papers (2024-09-11T21:53:20Z)
Learning to Ask: When LLM Agents Meet Unclear Instruction [55.65312637965779]
Large language models (LLMs) can leverage external tools for addressing a range of tasks unattainable through language skills alone. We evaluate the performance of LLMs tool-use under imperfect instructions, analyze the error patterns, and build a challenging tool-use benchmark called Noisy ToolBench. We propose a novel framework, Ask-when-Needed (AwN), which prompts LLMs to ask questions to users whenever they encounter obstacles due to unclear instructions.
arXiv Detail & Related papers (2024-08-31T23:06:12Z)
Get my drift? Catching LLM Task Drift with Activation Deltas [55.75645403965326]
Task drift allows attackers to exfiltrate data or influence the LLM's output for other users.<n>We show that a simple linear classifier can detect drift with near-perfect ROC AUC on an out-of-distribution test set.<n>We observe that this approach generalizes surprisingly well to unseen task domains, such as prompt injections, jailbreaks, and malicious instructions.
arXiv Detail & Related papers (2024-06-02T16:53:21Z)
Intelligent Virtual Assistants with LLM-based Process Automation [31.275267197246595]
This paper proposes a novel LLM-based virtual assistant that can automatically perform multi-step operations within mobile apps based on high-level user requests. The system represents an advance in assistants by providing an end-to-end solution for parsing instructions, reasoning about goals, and executing actions.
arXiv Detail & Related papers (2023-12-04T07:51:58Z)
ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks. To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z)
LLM for Test Script Generation and Migration: Challenges, Capabilities, and Opportunities [8.504639288314063]
Test script generation is a vital component of software testing, enabling efficient and reliable automation of repetitive test tasks. Existing generation approaches often encounter limitations, such as difficulties in accurately capturing and reproducing test scripts across diverse devices, platforms, and applications. This paper investigates the application of large language models (LLM) in the domain of mobile application test script generation.
arXiv Detail & Related papers (2023-09-24T07:58:57Z)
AskIt: Unified Programming Interface for Programming with Large Language Models [0.0]
Large Language Models (LLMs) exhibit a unique phenomenon known as emergent abilities, demonstrating adeptness across numerous tasks. This paper introduces AskIt, a domain-specific language specifically designed for LLMs. Across 50 tasks, AskIt generated concise prompts, achieving a 16.14 % reduction in prompt length compared to benchmarks.
arXiv Detail & Related papers (2023-08-29T21:44:27Z)
Low-code LLM: Graphical User Interface over Large Language Models [115.08718239772107]
This paper introduces a novel human-LLM interaction framework, Low-code LLM. It incorporates six types of simple low-code visual programming interactions to achieve more controllable and stable responses. We highlight three advantages of the low-code LLM: user-friendly interaction, controllable generation, and wide applicability.
arXiv Detail & Related papers (2023-04-17T09:27:40Z)
LEVER: Learning to Verify Language-to-Code Generation with Execution [64.36459105535]
We propose LEVER, a simple approach to improve language-to-code generation by learning to verify the generated programs with their execution results. Specifically, we train verifiers to determine whether a program sampled from the LLMs is correct or not based on the natural language input, the program itself and its execution results. LEVER consistently improves over the base code LLMs(4.6% to 10.9% with code-davinci) and achieves new state-of-the-art results on all of them.
arXiv Detail & Related papers (2023-02-16T18:23:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.