Related papers: Can You Mimic Me? Exploring the Use of Android Record & Replay Tools in Debugging

Can You Mimic Me? Exploring the Use of Android Record & Replay Tools in Debugging

URL: http://arxiv.org/abs/2504.20237v1
Date: Mon, 28 Apr 2025 20:15:59 GMT
Title: Can You Mimic Me? Exploring the Use of Android Record & Replay Tools in Debugging
Authors: Zihe Song, S M Hasan Mansur, Ravishka Rathnasuriya, Yumna Fatima, Wei Yang, Kevin Moran, Wing Lam,
Abstract summary: Record and replay (R&R) tools facilitate manual and automated UI testing by recording UI actions to execute test scenarios and replay bugs.<n>We conduct an empirical study on using R&R tools to record and replay non-crashing failures, crashing bugs, and feature-based user scenarios.<n>Results show that 17% of scenarios, 38% of non-crashing bugs, and 44% of crashing bugs cannot be reliably recorded and replayed.
Score: 13.79592937352459
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Android User Interface (UI) testing is a critical research area due to the ubiquity of apps and the challenges faced by developers. Record and replay (R&R) tools facilitate manual and automated UI testing by recording UI actions to execute test scenarios and replay bugs. These tools typically support (i) regression testing, (ii) non-crashing functional bug reproduction, and (iii) crashing bug reproduction. However, prior work only examines these tools in fragmented settings, lacking a comprehensive evaluation across common use cases. We address this gap by conducting an empirical study on using R&R tools to record and replay non-crashing failures, crashing bugs, and feature-based user scenarios, and explore combining R&R with automated input generation (AIG) tools to replay crashing bugs. Our study involves one industrial and three academic R&R tools, 34 scenarios from 17 apps, 90 non-crashing failures from 42 apps, and 31 crashing bugs from 17 apps. Results show that 17% of scenarios, 38% of non-crashing bugs, and 44% of crashing bugs cannot be reliably recorded and replayed, mainly due to action interval resolution, API incompatibility, and Android tooling limitations. Our findings highlight key future research directions to enhance the practical application of R&R tools.

Related papers

LazyReview A Dataset for Uncovering Lazy Thinking in NLP Peer Reviews [74.87393214734114]
This work introduces LazyReview, a dataset of peer-review sentences annotated with fine-grained lazy thinking categories. Large Language Models (LLMs) struggle to detect these instances in a zero-shot setting. instruction-based fine-tuning on our dataset significantly boosts performance by 10-20 performance points.
arXiv Detail & Related papers (2025-04-15T10:07:33Z)
Understanding and Detecting Compatibility Issues in Android Auto Apps [0.5908471365011941]
We study 147 reported issues related to Android Auto and identify their root causes.<n>More than 70% of issues result from UI incompatibilities, 24% from media playback errors, and around 5% from failures in voice command handling.<n>We introduce CarCompat, a static analysis framework that detects compatibility problems in Android Auto apps.
arXiv Detail & Related papers (2025-03-06T01:37:02Z)
AutoRestTest: A Tool for Automated REST API Testing Using LLMs and MARL [46.65963514391019]
AutoRestTest is a novel tool that integrates the Semantic Property Dependency Graph (SPDG) with Multi-Agent Reinforcement Learning (MARL) and large language models (LLMs) for effective REST API testing.
arXiv Detail & Related papers (2025-01-15T05:54:33Z)
LlamaRestTest: Effective REST API Testing with Small Language Models [50.058600784556816]
We present LlamaRestTest, a novel approach that employs two custom Large Language Models (LLMs) to generate realistic test inputs.<n>We evaluate it against several state-of-the-art REST API testing tools, including RESTGPT, a GPT-powered specification-enhancement tool.<n>Our study shows that small language models can perform as well as, or better than, large language models in REST API testing.
arXiv Detail & Related papers (2025-01-15T05:51:20Z)
Seeing is Believing: Vision-driven Non-crash Functional Bug Detection for Mobile Apps [26.96558418166514]
This paper proposes a novel vision-driven, multi-agent collaborative automated GUI testing approach for detecting non-crash functional bugs.<n>We evaluate Trident on 590 non-crash bugs and compare it with 12 baselines, it can achieve more than 14%-112% and 108%-147% boost in average recall and precision.
arXiv Detail & Related papers (2024-07-03T11:58:09Z)
Leveraging Stack Traces for Spectrum-based Fault Localization in the Absence of Failing Tests [44.13331329339185]
We introduce a new approach, SBEST, that integrates stack trace data with test coverage to enhance fault localization. Our approach shows a significant improvement, increasing Mean Average Precision (MAP) by 32.22% and Mean Reciprocal Rank (MRR) by 17.43% over traditional stack trace ranking methods.
arXiv Detail & Related papers (2024-05-01T15:15:52Z)
An Analysis of Bugs In Persistent Memory Application [0.0]
We evaluate an open-sourced automatic bug detector tool (i.e. AGAMOTTO) to test NVM level hashing PM application. Our faithful validation tool able to discovered 65 new NVM level hashing bugs on PMDK library. We will propose a Deep-Q Learning search algorithm over the PM-Aware search algorithm to improve the searching strategy efficiently.
arXiv Detail & Related papers (2023-07-19T23:12:01Z)
Prompting Is All You Need: Automated Android Bug Replay with Large Language Models [28.69675481931385]
We propose AdbGPT, a new lightweight approach to automatically reproduce the bugs from bug reports through prompt engineering. AdbGPT leverages few-shot learning and chain-of-thought reasoning to elicit human knowledge and logical reasoning from LLMs. Our evaluations demonstrate the effectiveness and efficiency of our AdbGPT to reproduce 81.3% of bug reports in 253.6 seconds.
arXiv Detail & Related papers (2023-06-03T03:03:52Z)
Large Language Models are Few-shot Testers: Exploring LLM-based General Bug Reproduction [14.444294152595429]
The number of tests added in open source repositories due to issues was about 28% of the corresponding project test suite size. We propose LIBRO, a framework that uses Large Language Models (LLMs), which have been shown to be capable of performing code-related tasks. Our evaluation of LIBRO shows that, on the widely studied Defects4J benchmark, LIBRO can generate failure reproducing test cases for 33% of all studied cases.
arXiv Detail & Related papers (2022-09-23T10:50:47Z)
BigIssue: A Realistic Bug Localization Benchmark [89.8240118116093]
BigIssue is a benchmark for realistic bug localization. We provide a general benchmark with a diversity of real and synthetic Java bugs. We hope to advance the state of the art in bug localization, in turn improving APR performance and increasing its applicability to the modern development cycle.
arXiv Detail & Related papers (2022-07-21T20:17:53Z)
SUPERNOVA: Automating Test Selection and Defect Prevention in AAA Video Games Using Risk Based Testing and Machine Learning [62.997667081978825]
Testing video games is an increasingly difficult task as traditional methods fail to scale with growing software systems. We present SUPERNOVA, a system responsible for test selection and defect prevention while also functioning as an automation hub. The direct impact of this has been observed to be a reduction in 55% or more testing hours for an undisclosed sports game title.
arXiv Detail & Related papers (2022-03-10T00:47:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.