InterEvo-TR: Interactive Evolutionary Test Generation With Readability
Assessment
- URL: http://arxiv.org/abs/2401.07072v1
- Date: Sat, 13 Jan 2024 13:14:29 GMT
- Title: InterEvo-TR: Interactive Evolutionary Test Generation With Readability
Assessment
- Authors: Pedro Delgado-P\'erez and Aurora Ram\'irez and Kevin J. Valle-G\'omez
and Inmaculada Medina-Bulo and Jos\'e Ra\'ul Romero
- Abstract summary: We propose incorporating interactive readability assessments made by a tester into EvoSuite.
Our approach, InterEvo-TR, interacts with the tester at different moments during the search.
Our results show that the strategy to select and present intermediate results is effective for the purpose of readability assessment.
- Score: 1.6874375111244329
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automated test case generation has proven to be useful to reduce the usually
high expenses of software testing. However, several studies have also noted the
skepticism of testers regarding the comprehension of generated test suites when
compared to manually designed ones. This fact suggests that involving testers
in the test generation process could be helpful to increase their acceptance of
automatically-produced test suites. In this paper, we propose incorporating
interactive readability assessments made by a tester into EvoSuite, a
widely-known evolutionary test generation tool. Our approach, InterEvo-TR,
interacts with the tester at different moments during the search and shows
different test cases covering the same coverage target for their subjective
evaluation. The design of such an interactive approach involves a schedule of
interaction, a method to diversify the selected targets, a plan to save and
handle the readability values, and some mechanisms to customize the level of
engagement in the revision, among other aspects. To analyze the potential and
practicability of our proposal, we conduct a controlled experiment in which 39
participants, including academics, professional developers, and student
collaborators, interact with InterEvo-TR. Our results show that the strategy to
select and present intermediate results is effective for the purpose of
readability assessment. Furthermore, the participants' actions and responses to
a questionnaire allowed us to analyze the aspects influencing test code
readability and the benefits and limitations of an interactive approach in the
context of test case generation, paving the way for future developments based
on interactivity.
Related papers
- AntEval: Evaluation of Social Interaction Competencies in LLM-Driven
Agents [65.16893197330589]
Large Language Models (LLMs) have demonstrated their ability to replicate human behaviors across a wide range of scenarios.
However, their capability in handling complex, multi-character social interactions has yet to be fully explored.
We introduce the Multi-Agent Interaction Evaluation Framework (AntEval), encompassing a novel interaction framework and evaluation methods.
arXiv Detail & Related papers (2024-01-12T11:18:00Z) - Towards Reliable AI: Adequacy Metrics for Ensuring the Quality of
System-level Testing of Autonomous Vehicles [5.634825161148484]
We introduce a set of black-box test adequacy metrics called "Test suite Instance Space Adequacy" (TISA) metrics.
The TISA metrics offer a way to assess both the diversity and coverage of the test suite and the range of bugs detected during testing.
We evaluate the efficacy of the TISA metrics by examining their correlation with the number of bugs detected in system-level simulation testing of AVs.
arXiv Detail & Related papers (2023-11-14T10:16:05Z) - Generative Judge for Evaluating Alignment [84.09815387884753]
We propose a generative judge with 13B parameters, Auto-J, designed to address these challenges.
Our model is trained on user queries and LLM-generated responses under massive real-world scenarios.
Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models.
arXiv Detail & Related papers (2023-10-09T07:27:15Z) - Unveiling the Sentinels: Assessing AI Performance in Cybersecurity Peer
Review [4.081120388114928]
In the field of cybersecurity, the practice of double-blind peer review is the de-facto standard.
This paper touches on the holy grail of peer reviewing and aims to shed light on the performance of AI in reviewing for academic security conferences.
We investigate the predictability of reviewing outcomes by comparing the results obtained from human reviewers and machine-learning models.
arXiv Detail & Related papers (2023-09-11T13:51:40Z) - Measuring Software Testability via Automatically Generated Test Cases [8.17364116624769]
We propose a new approach to pursuing testability measurements based on software metrics.
Our approach exploits automatic test generation and mutation analysis to quantify the evidence about the relative hardness of developing effective test cases.
arXiv Detail & Related papers (2023-07-30T09:48:51Z) - Is MultiWOZ a Solved Task? An Interactive TOD Evaluation Framework with
User Simulator [37.590563896382456]
We propose an interactive evaluation framework for Task-Oriented Dialogue (TOD) systems.
We first build a goal-oriented user simulator based on pre-trained models and then use the user simulator to interact with the dialogue system to generate dialogues.
Experimental results show that RL-based TOD systems trained by our proposed user simulator can achieve nearly 98% inform and success rates.
arXiv Detail & Related papers (2022-10-26T07:41:32Z) - Uncertainty-Driven Action Quality Assessment [67.20617610820857]
We propose a novel probabilistic model, named Uncertainty-Driven AQA (UD-AQA), to capture the diversity among multiple judge scores.
We generate the estimation of uncertainty for each prediction, which is employed to re-weight AQA regression loss.
Our proposed method achieves competitive results on three benchmarks including the Olympic events MTL-AQA and FineDiving, and the surgical skill JIGSAWS datasets.
arXiv Detail & Related papers (2022-07-29T07:21:15Z) - Co-Located Human-Human Interaction Analysis using Nonverbal Cues: A
Survey [71.43956423427397]
We aim to identify the nonverbal cues and computational methodologies resulting in effective performance.
This survey differs from its counterparts by involving the widest spectrum of social phenomena and interaction settings.
Some major observations are: the most often used nonverbal cue, computational method, interaction environment, and sensing approach are speaking activity, support vector machines, and meetings composed of 3-4 persons equipped with microphones and cameras, respectively.
arXiv Detail & Related papers (2022-07-20T13:37:57Z) - Discovering Boundary Values of Feature-based Machine Learning
Classifiers through Exploratory Datamorphic Testing [7.8729820663730035]
This paper proposes a set of testing strategies for testing machine learning applications in the framework of the datamorphism testing methodology.
Three variants of exploratory strategies are presented with the algorithms implemented in the automated datamorphic testing tool Morphy.
Their capability and cost of discovering borders between classes are evaluated via a set of controlled experiments with manually designed subjects and a set of case studies with real machine learning models.
arXiv Detail & Related papers (2021-10-01T11:47:56Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z) - Evaluating Interactive Summarization: an Expansion-Based Framework [97.0077722128397]
We develop an end-to-end evaluation framework for interactive summarization.
Our framework includes a procedure of collecting real user sessions and evaluation measures relying on standards.
All of our solutions are intended to be released publicly as a benchmark.
arXiv Detail & Related papers (2020-09-17T15:48:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.