Related papers: Seeing is Believing: Vision-driven Non-crash Functional Bug Detection for Mobile Apps

Seeing is Believing: Vision-driven Non-crash Functional Bug Detection for Mobile Apps

URL: http://arxiv.org/abs/2407.03037v2
Date: Wed, 04 Dec 2024 08:00:42 GMT
Title: Seeing is Believing: Vision-driven Non-crash Functional Bug Detection for Mobile Apps
Authors: Zhe Liu, Cheng Li, Chunyang Chen, Junjie Wang, Mengzhuo Chen, Boyu Wu, Yawen Wang, Jun Hu, Qing Wang,
Abstract summary: This paper proposes a novel vision-driven, multi-agent collaborative automated GUI testing approach for detecting non-crash functional bugs.<n>We evaluate Trident on 590 non-crash bugs and compare it with 12 baselines, it can achieve more than 14%-112% and 108%-147% boost in average recall and precision.
Score: 26.96558418166514
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Mobile app GUI (Graphical User Interface) pages now contain rich visual information, with the visual semantics of each page helping users understand the application logic. However, these complex visual and functional logic present new challenges to software testing. Existing automated GUI testing methods, constrained by the lack of reliable testing oracles, are limited to detecting crash bugs with obvious abnormal signals. Consequently, many non-crash functional bugs, ranging from unexpected behaviors to logical errors, often evade detection by current techniques. While these non-crash functional bugs can exhibit visual cues that serve as potential testing oracles, they often entail a sequence of screenshots, and detecting them necessitates an understanding of the operational logic among GUI page transitions, which is challenging traditional techniques. Considering the remarkable performance of Multimodal Large Language Models (MLLM) in visual and language understanding, this paper proposes Trident, a novel vision-driven, multi-agent collaborative automated GUI testing approach for detecting non-crash functional bugs. It comprises three agents: Explorer, Monitor, and Detector, to guide the exploration, oversee the testing progress, and spot issues. We also address several challenges, i.e., align visual and textual information for MLLM input, achieve functionality-oriented exploration, and infer test oracles for non-crash bugs, to enhance the performance of functionality bug detection. We evaluate Trident on 590 non-crash bugs and compare it with 12 baselines, it can achieve more than 14%-112% and 108%-147% boost in average recall and precision compared with the best baseline. The ablation study further proves the contribution of each module. Moreover, Trident identifies 43 new bugs on Google Play, of which 31 have been fixed.

Related papers

Forensics-Bench: A Comprehensive Forgery Detection Benchmark Suite for Large Vision Language Models [53.55128042938329]
Forensics-Bench is a new forgery detection evaluation benchmark suite. It comprises 63,292 meticulously curated multi-choice visual questions, covering 112 unique forgery detection types. We conduct thorough evaluations on 22 open-sourced LVLMs and 3 proprietary models GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet.
arXiv Detail & Related papers (2025-03-19T09:21:44Z)
Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection [56.66677293607114]
We propose Code-as-Monitor (CaM) for both open-set reactive and proactive failure detection. To enhance the accuracy and efficiency of monitoring, we introduce constraint elements that abstract constraint-related entities. Experiments show that CaM achieves a 28.7% higher success rate and reduces execution time by 31.8% under severe disturbances.
arXiv Detail & Related papers (2024-12-05T18:58:27Z)
ShowUI: One Vision-Language-Action Model for GUI Visual Agent [80.50062396585004]
Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity. We develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations. ShowUI, a lightweight 2B model using 256K data, achieves a strong 75.1% accuracy in zero-shot screenshot grounding.
arXiv Detail & Related papers (2024-11-26T14:29:47Z)
Leveraging Large Vision Language Model For Better Automatic Web GUI Testing [7.480576630392405]
This paper proposes VETL, the first LVLM-driven endtoend web testing technique. With LVLM's scene understanding capabilities, VETL can generate valid and meaningful text inputs focusing on the local context. The selection of associated GUI elements is formulated as a visual question-answering problem, allowing LVLM to capture the logical connection between the input box and the relevant element.
arXiv Detail & Related papers (2024-10-16T01:37:58Z)
GUI Action Narrator: Where and When Did That Action Take Place? [19.344324166716245]
We develop a video captioning benchmark for GUI actions, comprising 4,189 diverse video captioning samples. This task presents unique challenges compared to natural scene video captioning. We introduce our GUI action dataset textbfAct2Cap as well as a simple yet effective framework, textbfGUI Narrator, for GUI video captioning.
arXiv Detail & Related papers (2024-06-19T17:22:11Z)
VDebugger: Harnessing Execution Feedback for Debugging Visual Programs [103.61860743476933]
We introduce V Debugger, a critic-refiner framework trained to localize and debug visual programs by tracking execution step by step. V Debugger identifies and corrects program errors leveraging detailed execution feedback, improving interpretability and accuracy. Evaluations on six datasets demonstrate V Debugger's effectiveness, showing performance improvements of up to 3.2% in downstream task accuracy.
arXiv Detail & Related papers (2024-06-19T11:09:16Z)
GUICourse: From General Vision Language Models to Versatile GUI Agents [75.5150601913659]
We contribute GUICourse, a suite of datasets to train visual-based GUI agents. First, we introduce the GUIEnv dataset to strengthen the OCR and grounding capabilities of VLMs. Then, we introduce the GUIAct and GUIChat datasets to enrich their knowledge of GUI components and interactions.
arXiv Detail & Related papers (2024-06-17T08:30:55Z)
GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents [73.9254861755974]
This paper introduces a new dataset, called GUI-World, which features meticulously crafted Human-MLLM annotations. We evaluate the capabilities of current state-of-the-art MLLMs, including ImageLLMs and VideoLLMs, in understanding various types of GUI content.
arXiv Detail & Related papers (2024-06-16T06:56:53Z)
Artificial intelligence for context-aware visual change detection in software test automation [5.174422378856116]
We introduce a novel graph-based method for visual change detection in software test automation. Our method accurately identifies UI controls from software screenshots and constructs a graph representing contextual and spatial relationships between the controls. It can accurately detect visual software changes in various simple and complex test scenarios.
arXiv Detail & Related papers (2024-05-01T21:22:33Z)
CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation [61.68049335444254]
Multimodal large language models (MLLMs) have shown remarkable potential as human-like autonomous language agents to interact with real-world environments. We propose a Comprehensive Cognitive LLM Agent, CoCo-Agent, with two novel approaches, comprehensive environment perception (CEP) and conditional action prediction (CAP) With our technical design, our agent achieves new state-of-the-art performance on AITW and META-GUI benchmarks, showing promising abilities in realistic scenarios.
arXiv Detail & Related papers (2024-02-19T08:29:03Z)
SHIELD : An Evaluation Benchmark for Face Spoofing and Forgery Detection with Multimodal Large Language Models [61.8876114116716]
Multimodal large language models (MLLMs) have demonstrated strong capabilities in vision-related tasks. However, their ability to detect subtle visual spoofing and forgery clues in face attack detection tasks remains underexplored. We introduce a benchmark, SHIELD, to evaluate MLLMs for face spoofing and forgery detection.
arXiv Detail & Related papers (2024-02-06T17:31:36Z)
Make LLM a Testing Expert: Bringing Human-like Interaction to Mobile GUI Testing via Functionality-aware Decisions [23.460051600514806]
GPTDroid is a Q&A-based GUI testing framework for mobile apps. We introduce a functionality-aware memory prompting mechanism. It outperforms the best baseline by 32% in activity coverage, and detects 31% more bugs at a faster rate.
arXiv Detail & Related papers (2023-10-24T12:30:26Z)
FacTool: Factuality Detection in Generative AI -- A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios [87.12753459582116]
A wider range of tasks now face an increasing risk of containing factual errors when handled by generative models. We propose FacTool, a task and domain agnostic framework for detecting factual errors of texts generated by large language models.
arXiv Detail & Related papers (2023-07-25T14:20:51Z)
Prompting Is All You Need: Automated Android Bug Replay with Large Language Models [28.69675481931385]
We propose AdbGPT, a new lightweight approach to automatically reproduce the bugs from bug reports through prompt engineering. AdbGPT leverages few-shot learning and chain-of-thought reasoning to elicit human knowledge and logical reasoning from LLMs. Our evaluations demonstrate the effectiveness and efficiency of our AdbGPT to reproduce 81.3% of bug reports in 253.6 seconds.
arXiv Detail & Related papers (2023-06-03T03:03:52Z)
Contextual Object Detection with Multimodal Large Language Models [66.15566719178327]
We introduce a novel research problem of contextual object detection. Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering. We present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts.
arXiv Detail & Related papers (2023-05-29T17:50:33Z)
Chatting with GPT-3 for Zero-Shot Human-Like Mobile Automated GUI Testing [23.460051600514806]
We propose GPTDroid, asking Large Language Model to chat with the mobile apps by passing the GUI page information to LLM to elicit testing scripts. Within it, we extract the static context of the GUI page and the dynamic context of the iterative testing process. We evaluate GPTDroid on 86 apps from Google Play, and its activity coverage is 71%, with 32% higher than the best baseline, and can detect 36% more bugs with faster speed than the best baseline.
arXiv Detail & Related papers (2023-05-16T13:46:52Z)
ADPTriage: Approximate Dynamic Programming for Bug Triage [0.0]
We develop a Markov decision process (MDP) model for an online bug triage task. We provide an ADP-based bug triage solution, called ADPTriage, which reflects downstream uncertainty in the bug arrivals and developers' timetables. Our result shows a significant improvement over the myopic approach in terms of assignment accuracy and fixing time.
arXiv Detail & Related papers (2022-11-02T04:42:21Z)
Infrared: A Meta Bug Detector [10.541969253100815]
We propose a new approach, called meta bug detection, which offers three crucial advantages over existing learning-based bug detectors. Our evaluation shows our meta bug detector (MBD) is effective in catching a variety of bugs including null pointer dereference, array index out-of-bound, file handle leak, and even data races in concurrent programs.
arXiv Detail & Related papers (2022-09-18T09:08:51Z)
Continual Object Detection via Prototypical Task Correlation Guided Gating Mechanism [120.1998866178014]
We present a flexible framework for continual object detection via pRotOtypical taSk corrElaTion guided gaTingAnism (ROSETTA) Concretely, a unified framework is shared by all tasks while task-aware gates are introduced to automatically select sub-models for specific tasks. Experiments on COCO-VOC, KITTI-Kitchen, class-incremental detection on VOC and sequential learning of four tasks show that ROSETTA yields state-of-the-art performance.
arXiv Detail & Related papers (2022-05-06T07:31:28Z)
Beyond Accuracy: Behavioral Testing of NLP models with CheckList [66.42971817954806]
CheckList is a task-agnostic methodology for testing NLP models. CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation. In a user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.
arXiv Detail & Related papers (2020-05-08T15:48:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.