Related papers: Toward Generation of Test Cases from Task Descriptions via History-aware Planning

Toward Generation of Test Cases from Task Descriptions via History-aware Planning

URL: http://arxiv.org/abs/2504.14336v1
Date: Sat, 19 Apr 2025 16:03:03 GMT
Title: Toward Generation of Test Cases from Task Descriptions via History-aware Planning
Authors: Duy Cao, Phu Nguyen, Vy Le, Tien N. Nguyen, Vu Nguyen,
Abstract summary: In automated web testing, generating test scripts from natural language task descriptions is crucial for enhancing the test generation process.<n>This activity involves creating the correct sequences of actions to form test scripts for future testing activities.<n>We introduce HxAgent, an iterative large language model agent planning approach that determines the next action based on: 1) observations of the current contents and feasible actions, 2) short-term memory of previous web states and actions, and 3) long-term experience with (in)correct action sequences.
Score: 8.467983784989805
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: In automated web testing, generating test scripts from natural language task descriptions is crucial for enhancing the test generation process. This activity involves creating the correct sequences of actions to form test scripts for future testing activities. Current state-of-the-art approaches are limited in generating these action sequences, as they either demand substantial manual effort for human demonstrations or fail to consider the history of previous web content and actions to decide the next action. In this paper, we introduce HxAgent, an iterative large language model agent planning approach that determines the next action based on: 1) observations of the current contents and feasible actions, 2) short-term memory of previous web states and actions, and 3) long-term experience with (in)correct action sequences. The agent generates a sequence of actions to perform a given task, which is effectively an automated test case to verify the task. We conducted an extensive empirical evaluation of HxAgent using two datasets. On the MiniWoB++ dataset, our approach achieves 97% exact-match accuracy that is comparable to the best baselines while eliminating the need for human demonstrations required by those methods. For complex tasks requiring navigation through multiple actions and screens, HxAgent achieves an average 82% exact-match. On the second dataset, comprising 350 task instances across seven popular websites, including YouTube, LinkedIn, Facebook, and Google, HxAgent achieves high performance, with 87% of the action sequences exactly matching the ground truth and a prefix-match of 93%, outperforming the baseline by 59%.

Related papers

Visual Document Understanding and Question Answering: A Multi-Agent Collaboration Framework with Test-Time Scaling [83.78874399606379]
We propose MACT, a Multi-Agent Collaboration framework with Test-Time scaling.<n>It comprises four distinct small-scale agents, with clearly defined roles and effective collaboration.<n>It shows superior performance with a smaller parameter scale without sacrificing the ability of general and mathematical tasks.
arXiv Detail & Related papers (2025-08-05T12:52:09Z)
AnyPos: Automated Task-Agnostic Actions for Bimanual Manipulation [24.199522837278128]
We present a new notion of task-agnostic action paradigm that decouples action execution from task-specific conditioning.<n>ATARA is a scalable self-supervised framework that accelerates collection by over $ 30times $ compared to human teleoperation.<n>We propose AnyPos, an inverse dynamics model equipped with Arm-Decoupled Estimation and a Direction-Aware Decoder.
arXiv Detail & Related papers (2025-07-17T03:48:57Z)
GTA1: GUI Test-time Scaling Agent [77.60727242084971]
This paper investigates the two main challenges with our GUI Test-time Scaling Agent, GTA1.<n>First, to select the most appropriate action proposal, we introduce a test-time scaling method.<n>Second, we propose a model that achieves improved accuracy when grounding the selected action proposal to its corresponding visual elements.
arXiv Detail & Related papers (2025-07-08T08:52:18Z)
Iterative Trajectory Exploration for Multimodal Agents [69.32855772335624]
We propose an online self-exploration method for multimodal agents, namely SPORT. SPORT operates through four iterative components: task synthesis, step sampling, step verification, and preference tuning. Evaluation in the GTA and GAIA benchmarks show that the SPORT Agent achieves 6.41% and 3.64% improvements.
arXiv Detail & Related papers (2025-04-30T12:01:27Z)
Boosting Virtual Agent Learning and Reasoning: A Step-wise, Multi-dimensional, and Generalist Reward Model with Benchmark [72.46357004059661]
We propose Similar, a step-wise Multi-dimensional Generalist Reward Model.<n>It offers fine-grained signals for agent training and can choose better action for inference-time scaling.<n>We introduce the first benchmark in the virtual agent domain for step-wise, multi-dimensional reward model training and evaluation.
arXiv Detail & Related papers (2025-03-24T13:30:47Z)
HiAgent: Hierarchical Working Memory Management for Solving Long-Horizon Agent Tasks with Large Language Model [39.169389255970806]
HiAgent is a framework that leverages subgoals as memory chunks to manage the working memory of Large Language Model (LLM)-based agents hierarchically. Results show that HiAgent achieves a twofold increase in success rate and reduces the average number of steps required by 3.8.
arXiv Detail & Related papers (2024-08-18T17:59:49Z)
Tree Search for Language Model Agents [69.43007235771383]
We propose an inference-time search algorithm for LM agents to perform exploration and multi-step planning in interactive web environments. Our approach is a form of best-first tree search that operates within the actual environment space. It is the first tree search algorithm for LM agents that shows effectiveness on realistic web tasks.
arXiv Detail & Related papers (2024-07-01T17:07:55Z)
Android in the Zoo: Chain-of-Action-Thought for GUI Agents [38.07337874116759]
This work presents Chain-of-Action-Thought (dubbed CoAT), which takes the description of the previous actions, the current screen, and more importantly the action thinking of what actions should be performed and the outcomes led by the chosen action. We demonstrate that, in a zero-shot setting upon three off-the-shelf LMMs, CoAT significantly improves the action prediction compared to previous proposed context modeling. To further facilitate the research in this line, we construct a dataset Android-In-The-Zoo (AitZ), which contains 18,643 screen-action pairs together with chain-of-action
arXiv Detail & Related papers (2024-03-05T07:09:35Z)
MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation [96.71370747681078]
We introduce MLAgentBench, a suite of 13 tasks ranging from improving model performance on CIFAR-10 to recent research problems like BabyLM. For each task, an agent can perform actions like reading/writing files, executing code, and inspecting outputs. We benchmark agents based on Claude v1.0, Claude v2.1, Claude v3 Opus, GPT-4, GPT-4-turbo, Gemini-Pro, and Mixtral and find that a Claude v3 Opus agent is the best in terms of success rate.
arXiv Detail & Related papers (2023-10-05T04:06:12Z)
JOADAA: joint online action detection and action anticipation [2.7792814152937027]
Action anticipation involves forecasting future actions by connecting past events to future ones. Online action detection is the task of predicting actions in a streaming manner. By combining action anticipation and online action detection, our approach can cover the missing dependencies of future information.
arXiv Detail & Related papers (2023-09-12T11:17:25Z)
QUERT: Continual Pre-training of Language Model for Query Understanding in Travel Domain Search [15.026682829320261]
We propose QUERT, A Continual Pre-trained Language Model for QUERy Understanding in Travel Domain Search. Quert is jointly trained on four tailored pre-training tasks to the characteristics of query in travel domain search. To check on the improvement of QUERT to online business, we deploy QUERT and perform A/B testing on Fliggy APP.
arXiv Detail & Related papers (2023-06-11T15:39:59Z)
Non-Sequential Graph Script Induction via Multimedia Grounding [129.83134296316493]
We train a script knowledge model capable of both generating explicit graph scripts for learnt tasks and predicting future steps given a partial step sequence. Human evaluation shows our model outperforming the WikiHow linear baseline by 48.76% absolute gains in capturing sequential and non-sequential step relationships.
arXiv Detail & Related papers (2023-05-27T18:13:17Z)
Neural Task Success Classifiers for Robotic Manipulation from Few Real Demonstrations [1.7205106391379026]
This paper presents a novel classifier that learns to classify task completion only from a few demonstrations. We compare different neural classifiers, e.g. fully connected-based, fully convolutional-based, sequence2sequence-based, and domain adaptation-based classification. Our model, i.e. fully convolutional neural network with domain adaptation and timing features, achieves an average classification accuracy of 97.3% and 95.5% across tasks.
arXiv Detail & Related papers (2021-07-01T19:58:16Z)
Document-Level Event Argument Extraction by Conditional Generation [75.73327502536938]
Event extraction has long been treated as a sentence-level task in the IE community. We propose a document-level neural event argument extraction model by formulating the task as conditional generation following event templates. We also compile a new document-level event extraction benchmark dataset WikiEvents.
arXiv Detail & Related papers (2021-04-13T03:36:38Z)
Action Sequence Predictions of Vehicles in Urban Environments using Map and Social Context [152.0714518512966]
This work studies the problem of predicting the sequence of future actions for surround vehicles in real-world driving scenarios. The first contribution is an automatic method to convert the trajectories recorded in real-world driving scenarios to action sequences with the help of HD maps. The second contribution lies in applying the method to the well-known traffic agent tracking and prediction dataset Argoverse, resulting in 228,000 action sequences. The third contribution is to propose a novel action sequence prediction method by integrating past positions and velocities of the traffic agents, map information and social context into a single end-to-end trainable neural network
arXiv Detail & Related papers (2020-04-29T14:59:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.