Related papers: VLM-Fuzz: Vision Language Model Assisted Recursive Depth-first Search Exploration for Effective UI Testing of Android Apps

VLM-Fuzz: Vision Language Model Assisted Recursive Depth-first Search Exploration for Effective UI Testing of Android Apps

URL: http://arxiv.org/abs/2504.11675v1
Date: Wed, 16 Apr 2025 00:19:31 GMT
Title: VLM-Fuzz: Vision Language Model Assisted Recursive Depth-first Search Exploration for Effective UI Testing of Android Apps
Authors: Biniam Fisseha Demissie, Yan Naing Tun, Lwin Khin Shar, Mariano Ceccato,
Abstract summary: Testing Android apps effectively requires a systematic exploration of the app's possible states.<n>We propose a novel, automated fuzzing approach called VLM-Fuzz for effective UI testing of Android apps.
Score: 6.122273281101832
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Testing Android apps effectively requires a systematic exploration of the app's possible states by simulating user interactions and system events. While existing approaches have proposed several fuzzing techniques to generate various text inputs and trigger user and system events for UI state exploration, achieving high code coverage remains a significant challenge in Android app testing. The main challenges are (1) reasoning about the complex and dynamic layout of UI screens; (2) generating required inputs/events to deal with certain widgets like pop-ups; and (3) coordination between current test inputs and previous inputs to avoid getting stuck in the same UI screen without improving test coverage. To address these problems, we propose a novel, automated fuzzing approach called VLM-Fuzz for effective UI testing of Android apps. We present a novel heuristic-based depth-first search (DFS) exploration algorithm, assisted with a vision language model (VLM), to effectively explore the UI states of the app. We use static analysis to analyze the Android Manifest file and the runtime UI hierarchy XML to extract the list of components, intent-filters and interactive UI widgets. VLM is used to reason about complex UI layout and widgets on an on-demand basis. Based on the inputs from static analysis, VLM, and the current UI state, we use some heuristics to deal with the above-mentioned challenges. We evaluated VLM-Fuzz based on a benchmark containing 59 apps obtained from a recent work and compared it against two state-of-the-art approaches: APE and DeepGUI. VLM-Fuzz outperforms the best baseline by 9.0%, 3.7%, and 2.1% in terms of class coverage, method coverage, and line coverage, respectively. We also ran VLM-Fuzz on 80 recent Google Play apps (i.e., updated in 2024). VLM-Fuzz detected 208 unique crashes in 24 apps, which have been reported to respective developers.

Related papers

Toward Autonomous UI Exploration: The UIExplorer Benchmark [10.669221849705165]
We introduce UIExplore-Bench, the first benchmark explicitly dedicated to UI exploration.<n>The benchmark evaluates agents with either Structured mode (granting access to layout information like DOM trees) or Screen mode (relying on GUI-only observations such as screenshots and human-like mouse/keyboard interactions) across three levels in a standardized GitLab sandbox environment.<n>Our results show that UIExplore-AlGo achieves the leading mean hUFO scores, reaching up to 77.2% of human performance in Structured mode and 59.0% in Screen mode at 2,000 steps, particularly excelling at the Sparse level.
arXiv Detail & Related papers (2025-06-21T18:16:27Z)
MAGNET: A Multi-agent Framework for Finding Audio-Visual Needles by Reasoning over Multi-Video Haystacks [67.31276358668424]
We introduce a novel task named AV-HaystacksQA, where the goal is to identify salient segments across different videos in response to a query and link them together to generate the most informative answer.<n> AVHaystacks is an audio-visual benchmark comprising 3100 annotated QA pairs designed to assess the capabilities of LMMs in multi-video retrieval and temporal grounding task.<n>We propose a model-agnostic, multi-agent framework to address this challenge, achieving up to 89% and 65% relative improvements over baseline methods on BLEU@4 and GPT evaluation scores in QA task on our proposed AVHaystack
arXiv Detail & Related papers (2025-06-08T06:34:29Z)
GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents [93.49577107524176]
We propose GUI-Actor, a VLM-based method for coordinate-free GUI grounding.<n>At its core, GUI-Actor introduces an attention-based action head that learns to align a dedicated ACTOR> token with all relevant visual patch tokens.<n>Experiments show that GUI-Actor outperforms prior state-of-the-art methods on multiple GUI action grounding benchmarks.
arXiv Detail & Related papers (2025-06-03T17:59:08Z)
GUI-explorer: Autonomous Exploration and Mining of Transition-aware Knowledge for GUI Agent [66.34801160469067]
MLLMs suffer from two key issues: misinterpreting UI components and outdated knowledge.<n>We propose GUI-explorer, a training-free GUI agent that incorporates two fundamental mechanisms.<n>With a task success rate of 53.7% on SPA-Bench and 47.4% on AndroidWorld, GUI-explorer shows significant improvements over SOTA agents.
arXiv Detail & Related papers (2025-05-22T16:01:06Z)
Visual Test-time Scaling for GUI Agent Grounding [61.609126885427386]
We introduce RegionFocus, a visual test-time scaling approach for Vision Language Model Agents. Our approach dynamically zooms in on relevant regions, reducing background clutter and improving grounding accuracy. We observe significant performance gains of 28+% on Screenspot-pro and 24+% on WebVoyager benchmarks.
arXiv Detail & Related papers (2025-05-01T17:45:59Z)
AutoGUI: Scaling GUI Grounding with Automatic Functionality Annotations from LLMs [54.58905728115257]
We propose the methodname pipeline for automatically annotating UI elements with detailed functionality descriptions at scale.<n> Specifically, we leverage large language models (LLMs) to infer element functionality by comparing the UI content changes before and after simulated interactions with specific UI elements.<n>We construct an methodname-704k dataset using the proposed pipeline, featuring multi-resolution, multi-device screenshots, diverse data domains, and detailed functionality annotations that have never been provided by previous datasets.
arXiv Detail & Related papers (2025-02-04T03:39:59Z)
UI-TARS: Pioneering Automated GUI Interaction with Native Agents [58.18100825673032]
This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions.<n>In the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively)
arXiv Detail & Related papers (2025-01-21T17:48:10Z)
Beyond Pass or Fail: Multi-Dimensional Benchmarking of Foundation Models for Goal-based Mobile UI Navigation [15.80796682874844]
We propose Sphinx, a benchmark for evaluation of foundation models (FMs) in industrial settings of user interface ( UI) navigation.<n>We evaluate 8 FMs with 20 different configurations using both Google Play applications and WeChat's internal UI test cases.<n>Our results show that existing FMs universally struggle with goal-based testing tasks, primarily due to insufficient UI-specific capabilities.
arXiv Detail & Related papers (2025-01-06T09:10:11Z)
Model-Enhanced LLM-Driven VUI Testing of VPA Apps [10.451676569481148]
We introduce Elevate, a model-enhanced large language model (LLM)-driven VUI testing framework. It is benchmarked on 4,000 real-world Alexa skills, against the state-of-the-art tester Vitas. It achieves 15% higher state space coverage compared to Vitas on all types of apps, and exhibits significant advancement in efficiency.
arXiv Detail & Related papers (2024-07-03T03:36:05Z)
Tell Me What's Next: Textual Foresight for Generic UI Representations [65.10591722192609]
We propose Textual Foresight, a novel pretraining objective for learning UI screen representations. Textual Foresight generates global text descriptions of future UI states given a current UI and local action taken. We train with our newly constructed mobile app dataset, OpenApp, which results in the first public dataset for app UI representation learning.
arXiv Detail & Related papers (2024-06-12T02:43:19Z)
Large Language Models for Mobile GUI Text Input Generation: An Empirical Study [24.256184336154544]
Large Language Models (LLMs) have demonstrated excellent text-generation capabilities.<n>This paper extensively investigates the effectiveness of nine state-of-the-art LLMs in Android text-input generation for UI pages.
arXiv Detail & Related papers (2024-04-13T09:56:50Z)
Make LLM a Testing Expert: Bringing Human-like Interaction to Mobile GUI Testing via Functionality-aware Decisions [23.460051600514806]
GPTDroid is a Q&A-based GUI testing framework for mobile apps. We introduce a functionality-aware memory prompting mechanism. It outperforms the best baseline by 32% in activity coverage, and detects 31% more bugs at a faster rate.
arXiv Detail & Related papers (2023-10-24T12:30:26Z)
ILuvUI: Instruction-tuned LangUage-Vision modeling of UIs from Machine Conversations [13.939350184164017]
Multimodal Vision-Language Models (VLMs) enable powerful applications from their fused understanding of images and language. We adapt a recipe for generating paired text-image training data for VLMs to the UI domain by combining existing pixel-based methods with a Large Language Model (LLM) We generate a dataset of 335K conversational examples paired with UIs that cover Q&A, UI descriptions, and planning, and use it to fine-tune a conversational VLM for UI tasks.
arXiv Detail & Related papers (2023-10-07T16:32:34Z)
Multi-modal Queried Object Detection in the Wild [72.16067634379226]
MQ-Det is an efficient architecture and pre-training strategy design for real-world object detection. It incorporates vision queries into existing language-queried-only detectors. MQ-Det's simple yet effective architecture and training strategy design is compatible with most language-queried object detectors.
arXiv Detail & Related papers (2023-05-30T12:24:38Z)
Chatting with GPT-3 for Zero-Shot Human-Like Mobile Automated GUI Testing [23.460051600514806]
We propose GPTDroid, asking Large Language Model to chat with the mobile apps by passing the GUI page information to LLM to elicit testing scripts. Within it, we extract the static context of the GUI page and the dynamic context of the iterative testing process. We evaluate GPTDroid on 86 apps from Google Play, and its activity coverage is 71%, with 32% higher than the best baseline, and can detect 36% more bugs with faster speed than the best baseline.
arXiv Detail & Related papers (2023-05-16T13:46:52Z)
Emerging App Issue Identification via Online Joint Sentiment-Topic Tracing [66.57888248681303]
We propose a novel emerging issue detection approach named MERIT. Based on the AOBST model, we infer the topics negatively reflected in user reviews for one app version. Experiments on popular apps from Google Play and Apple's App Store demonstrate the effectiveness of MERIT.
arXiv Detail & Related papers (2020-08-23T06:34:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.