InterEvo-TR: Interactive Evolutionary Test Generation With Readability
Assessment
- URL: http://arxiv.org/abs/2401.07072v1
- Date: Sat, 13 Jan 2024 13:14:29 GMT
- Title: InterEvo-TR: Interactive Evolutionary Test Generation With Readability
Assessment
- Authors: Pedro Delgado-P\'erez and Aurora Ram\'irez and Kevin J. Valle-G\'omez
and Inmaculada Medina-Bulo and Jos\'e Ra\'ul Romero
- Abstract summary: We propose incorporating interactive readability assessments made by a tester into EvoSuite.
Our approach, InterEvo-TR, interacts with the tester at different moments during the search.
Our results show that the strategy to select and present intermediate results is effective for the purpose of readability assessment.
- Score: 1.6874375111244329
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automated test case generation has proven to be useful to reduce the usually
high expenses of software testing. However, several studies have also noted the
skepticism of testers regarding the comprehension of generated test suites when
compared to manually designed ones. This fact suggests that involving testers
in the test generation process could be helpful to increase their acceptance of
automatically-produced test suites. In this paper, we propose incorporating
interactive readability assessments made by a tester into EvoSuite, a
widely-known evolutionary test generation tool. Our approach, InterEvo-TR,
interacts with the tester at different moments during the search and shows
different test cases covering the same coverage target for their subjective
evaluation. The design of such an interactive approach involves a schedule of
interaction, a method to diversify the selected targets, a plan to save and
handle the readability values, and some mechanisms to customize the level of
engagement in the revision, among other aspects. To analyze the potential and
practicability of our proposal, we conduct a controlled experiment in which 39
participants, including academics, professional developers, and student
collaborators, interact with InterEvo-TR. Our results show that the strategy to
select and present intermediate results is effective for the purpose of
readability assessment. Furthermore, the participants' actions and responses to
a questionnaire allowed us to analyze the aspects influencing test code
readability and the benefits and limitations of an interactive approach in the
context of test case generation, paving the way for future developments based
on interactivity.
Related papers
- Gamifying Testing in IntelliJ: A Replicability Study [8.689182960457137]
Gamification is an emerging technique to enhance motivation and performance in traditionally unengaging tasks like software testing.
Previous studies have indicated that gamified systems have the potential to improve software testing processes by providing testers with achievements and feedback.
This paper aims to replicate and validate the effects of IntelliGame, a gamification plugin for IntelliJ IDEA to engage developers in writing and executing tests.
arXiv Detail & Related papers (2025-04-27T16:17:11Z) - Requirements-Driven Automated Software Testing: A Systematic Review [13.67495800498868]
This study synthesizes the current state of REDAST research, highlights trends, and proposes future directions.
This systematic literature review ( SLR) explores the landscape of REDAST by analyzing requirements input, transformation techniques, test outcomes, evaluation methods, and existing limitations.
arXiv Detail & Related papers (2025-02-25T23:13:09Z) - Interactive Agents to Overcome Ambiguity in Software Engineering [61.40183840499932]
AI agents are increasingly being deployed to automate tasks, often based on ambiguous and underspecified user instructions.
Making unwarranted assumptions and failing to ask clarifying questions can lead to suboptimal outcomes.
We study the ability of LLM agents to handle ambiguous instructions in interactive code generation settings by evaluating proprietary and open-weight models on their performance.
arXiv Detail & Related papers (2025-02-18T17:12:26Z) - Adaptive Testing for LLM-Based Applications: A Diversity-based Approach [15.33985438101206]
We show that diversity-based testing techniques, such as Adaptive Random Testing (ART), can be effectively applied to the testing of prompt templates.
Our results, obtained using various implementations that explore several string-based distances, confirm that our approach enables the discovery of failures with reduced testing budgets.
arXiv Detail & Related papers (2025-01-23T08:53:12Z) - NLP and Education: using semantic similarity to evaluate filled gaps in a large-scale Cloze test in the classroom [0.0]
Using data from Cloze tests administered to students in Brazil, WE models for Brazilian Portuguese (PT-BR) were employed to measure semantic similarity.
A comparative analysis between the WE models' scores and the judges' evaluations revealed that GloVe was the most effective model.
arXiv Detail & Related papers (2024-11-02T15:22:26Z) - Which Combination of Test Metrics Can Predict Success of a Software Project? A Case Study in a Year-Long Project Course [1.553083901660282]
Testing plays an important role in securing the success of a software development project.
We investigate whether we can quantify the effects various types of testing have on functional suitability.
arXiv Detail & Related papers (2024-08-22T04:23:51Z) - AntEval: Evaluation of Social Interaction Competencies in LLM-Driven
Agents [65.16893197330589]
Large Language Models (LLMs) have demonstrated their ability to replicate human behaviors across a wide range of scenarios.
However, their capability in handling complex, multi-character social interactions has yet to be fully explored.
We introduce the Multi-Agent Interaction Evaluation Framework (AntEval), encompassing a novel interaction framework and evaluation methods.
arXiv Detail & Related papers (2024-01-12T11:18:00Z) - Towards Reliable AI: Adequacy Metrics for Ensuring the Quality of
System-level Testing of Autonomous Vehicles [5.634825161148484]
We introduce a set of black-box test adequacy metrics called "Test suite Instance Space Adequacy" (TISA) metrics.
The TISA metrics offer a way to assess both the diversity and coverage of the test suite and the range of bugs detected during testing.
We evaluate the efficacy of the TISA metrics by examining their correlation with the number of bugs detected in system-level simulation testing of AVs.
arXiv Detail & Related papers (2023-11-14T10:16:05Z) - Generative Judge for Evaluating Alignment [84.09815387884753]
We propose a generative judge with 13B parameters, Auto-J, designed to address these challenges.
Our model is trained on user queries and LLM-generated responses under massive real-world scenarios.
Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models.
arXiv Detail & Related papers (2023-10-09T07:27:15Z) - Measuring Software Testability via Automatically Generated Test Cases [8.17364116624769]
We propose a new approach to pursuing testability measurements based on software metrics.
Our approach exploits automatic test generation and mutation analysis to quantify the evidence about the relative hardness of developing effective test cases.
arXiv Detail & Related papers (2023-07-30T09:48:51Z) - From Static Benchmarks to Adaptive Testing: Psychometrics in AI Evaluation [60.14902811624433]
We discuss a paradigm shift from static evaluation methods to adaptive testing.
This involves estimating the characteristics and value of each test item in the benchmark and dynamically adjusting items in real-time.
We analyze the current approaches, advantages, and underlying reasons for adopting psychometrics in AI evaluation.
arXiv Detail & Related papers (2023-06-18T09:54:33Z) - Is MultiWOZ a Solved Task? An Interactive TOD Evaluation Framework with
User Simulator [37.590563896382456]
We propose an interactive evaluation framework for Task-Oriented Dialogue (TOD) systems.
We first build a goal-oriented user simulator based on pre-trained models and then use the user simulator to interact with the dialogue system to generate dialogues.
Experimental results show that RL-based TOD systems trained by our proposed user simulator can achieve nearly 98% inform and success rates.
arXiv Detail & Related papers (2022-10-26T07:41:32Z) - Uncertainty-Driven Action Quality Assessment [67.20617610820857]
We propose a novel probabilistic model, named Uncertainty-Driven AQA (UD-AQA), to capture the diversity among multiple judge scores.
We generate the estimation of uncertainty for each prediction, which is employed to re-weight AQA regression loss.
Our proposed method achieves competitive results on three benchmarks including the Olympic events MTL-AQA and FineDiving, and the surgical skill JIGSAWS datasets.
arXiv Detail & Related papers (2022-07-29T07:21:15Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z) - Evaluating Interactive Summarization: an Expansion-Based Framework [97.0077722128397]
We develop an end-to-end evaluation framework for interactive summarization.
Our framework includes a procedure of collecting real user sessions and evaluation measures relying on standards.
All of our solutions are intended to be released publicly as a benchmark.
arXiv Detail & Related papers (2020-09-17T15:48:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.