Related papers: Towards Test Generation from Task Description for Mobile Testing with Multi-modal Reasoning

Towards Test Generation from Task Description for Mobile Testing with Multi-modal Reasoning

URL: http://arxiv.org/abs/2504.15917v1
Date: Tue, 22 Apr 2025 14:02:57 GMT
Title: Towards Test Generation from Task Description for Mobile Testing with Multi-modal Reasoning
Authors: Hieu Huynh, Hai Phung, Hao Pham, Tien N. Nguyen, Vu Nguyen,
Abstract summary: We introduce VisiDroid, a multi-modal, multi-agent framework that iteratively determines the next action and leverages visual images of screens to detect the task's completeness.<n>Our evaluation shows that VisiDroid achieves an accuracy of 87.3%, outperforming the best baseline relatively by 23.5%.
Score: 8.363126388041408
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: In Android GUI testing, generating an action sequence for a task that can be replayed as a test script is common. Generating sequences of actions and respective test scripts from task goals described in natural language can eliminate the need for manually writing test scripts. However, existing approaches based on large language models (LLM) often struggle with identifying the final action, and either end prematurely or continue past the final screen. In this paper, we introduce VisiDroid, a multi-modal, LLM-based, multi-agent framework that iteratively determines the next action and leverages visual images of screens to detect the task's completeness. The multi-modal approach enhances our model in two significant ways. First, this approach enables it to avoid prematurely terminating a task when textual content alone provides misleading indications of task completion. Additionally, visual input helps the tool avoid errors when changes in the GUI do not directly affect functionality toward task completion, such as adjustments to font sizes or colors. Second, the multi-modal approach also ensures the tool not progress beyond the final screen, which might lack explicit textual indicators of task completion but could display a visual element indicating task completion, which is common in GUI apps. Our evaluation shows that VisiDroid achieves an accuracy of 87.3%, outperforming the best baseline relatively by 23.5%. We also demonstrate that our multi-modal framework with images and texts enables the LLM to better determine when a task is completed.

Related papers

GTA1: GUI Test-time Scaling Agent [77.60727242084971]
This paper investigates the two main challenges with our GUI Test-time Scaling Agent, GTA1.<n>First, to select the most appropriate action proposal, we introduce a test-time scaling method.<n>Second, we propose a model that achieves improved accuracy when grounding the selected action proposal to its corresponding visual elements.
arXiv Detail & Related papers (2025-07-08T08:52:18Z)
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment [58.94611347128066]
Task Preference Optimization (TPO) is a novel method that utilizes differentiable task preferences derived from typical fine-grained visual tasks.<n>By leveraging rich visual labels during training, TPO significantly enhances the MLLM's multimodal capabilities and task-specific performance.<n>Our instantiation of this approach with VideoChat and LLaVA demonstrates an overall 14.6% improvement in multimodal performance compared to baseline models.
arXiv Detail & Related papers (2024-12-26T18:56:05Z)
VideoGUI: A Benchmark for GUI Automation from Instructional Videos [78.97292966276706]
VideoGUI is a novel multi-modal benchmark designed to evaluate GUI assistants on visual-centric GUI tasks. Sourced from high-quality web instructional videos, our benchmark focuses on tasks involving professional and novel software. Our evaluation reveals that even the SoTA large multimodal model GPT4o performs poorly on visual-centric GUI tasks.
arXiv Detail & Related papers (2024-06-14T17:59:08Z)
TaskCLIP: Extend Large Vision-Language Model for Task Oriented Object Detection [23.73648235283315]
Task-oriented object detection aims to find objects suitable for accomplishing specific tasks. Recent solutions are mainly all-in-one models. We propose TaskCLIP, a more natural two-stage design composed of general object detection and task-guided object selection.
arXiv Detail & Related papers (2024-03-12T22:33:02Z)
Android in the Zoo: Chain-of-Action-Thought for GUI Agents [38.07337874116759]
This work presents Chain-of-Action-Thought (dubbed CoAT), which takes the description of the previous actions, the current screen, and more importantly the action thinking of what actions should be performed and the outcomes led by the chosen action. We demonstrate that, in a zero-shot setting upon three off-the-shelf LMMs, CoAT significantly improves the action prediction compared to previous proposed context modeling. To further facilitate the research in this line, we construct a dataset Android-In-The-Zoo (AitZ), which contains 18,643 screen-action pairs together with chain-of-action
arXiv Detail & Related papers (2024-03-05T07:09:35Z)
Multitask Multimodal Prompted Training for Interactive Embodied Task Completion [48.69347134411864]
Embodied MultiModal Agent (EMMA) is a unified encoder-decoder model that reasons over images and trajectories. By unifying all tasks as text generation, EMMA learns a language of actions which facilitates transfer across tasks.
arXiv Detail & Related papers (2023-11-07T15:27:52Z)
ControlLLM: Augment Language Models with Tools by Searching on Graphs [97.62758830255002]
We present ControlLLM, a novel framework that enables large language models (LLMs) to utilize multi-modal tools for solving real-world tasks. Our framework comprises three key components: (1) a textittask decomposer that breaks down a complex task into clear subtasks with well-defined inputs and outputs; (2) a textitThoughts-on-Graph (ToG) paradigm that searches the optimal solution path on a pre-built tool graph; and (3) an textitexecution engine with a rich toolbox that interprets the solution path and runs the
arXiv Detail & Related papers (2023-10-26T21:57:21Z)
You Only Look at Screens: Multimodal Chain-of-Action Agents [37.118034745972956]
Auto-GUI is a multimodal solution that directly interacts with the interface. We propose a chain-of-action technique to help the agent decide what action to execute. We evaluate our approach on a new device-control benchmark AITW with 30$K$ unique instructions.
arXiv Detail & Related papers (2023-09-20T16:12:32Z)
TransPrompt v2: A Transferable Prompting Framework for Cross-task Text Classification [37.824031151922604]
We propose TransPrompt v2, a novel transferable prompting framework for few-shot learning across similar or distant text classification tasks. For learning across similar tasks, we employ a multi-task meta-knowledge acquisition (MMA) procedure to train a meta-learner. For learning across distant tasks, we inject the task type descriptions into the prompt, and capture the intra-type and inter-type prompt embeddings.
arXiv Detail & Related papers (2023-08-29T04:16:57Z)
SimpleMTOD: A Simple Language Model for Multimodal Task-Oriented Dialogue with Symbolic Scene Representation [2.4469484645516837]
SimpleMTOD recasts several sub-tasks in multimodal task-oriented dialogues as sequence prediction tasks. We introduce both local and de-localized tokens for objects within a scene. The model does not rely on task-specific architectural changes such as classification heads.
arXiv Detail & Related papers (2023-07-10T21:16:46Z)
Prompt Tuning with Soft Context Sharing for Vision-Language Models [42.61889428498378]
We propose a novel method to tune pre-trained vision-language models on multiple target few-shot tasks jointly. We show that SoftCPT significantly outperforms single-task prompt tuning methods.
arXiv Detail & Related papers (2022-08-29T10:19:10Z)
Multimedia Generative Script Learning for Task Planning [58.73725388387305]
We propose a new task, Multimedia Generative Script Learning, to generate subsequent steps by tracking historical states in both text and vision modalities. This task is challenging in three aspects: the multimedia challenge of capturing the visual states in images, the induction challenge of performing unseen tasks, and the diversity challenge of covering different information in individual steps. Experiment results demonstrate that our approach significantly outperforms strong baselines.
arXiv Detail & Related papers (2022-08-25T19:04:28Z)
Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation [86.26522210882699]
We propose Unified multimodal pre-training for both Vision-Language understanding and generation. The proposed UniVL is capable of handling both understanding tasks and generative tasks. Our experiments show that there is a trade-off between understanding tasks and generation tasks while using the same model.
arXiv Detail & Related papers (2021-12-10T14:59:06Z)
VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding [78.28397557433544]
We present a task-agnostic multi-modal pre-training approach that can accept either video or text input, or both for a variety of end tasks. Experimental results show strong performance across a wider range of tasks than any previous methods, often outperforming task-specific pre-training.
arXiv Detail & Related papers (2021-05-20T19:13:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.