Related papers: FlakyFix: Using Large Language Models for Predicting Flaky Test Fix Categories and Test Code Repair

FlakyFix: Using Large Language Models for Predicting Flaky Test Fix Categories and Test Code Repair

URL: http://arxiv.org/abs/2307.00012v4
Date: Fri, 30 Aug 2024 22:49:32 GMT
Title: FlakyFix: Using Large Language Models for Predicting Flaky Test Fix Categories and Test Code Repair
Authors: Sakina Fatima, Hadi Hemmati, Lionel Briand,
Abstract summary: Flaky tests are problematic because they non-deterministically pass or fail for the same software version under test. In this paper, we focus on predicting the type of fix that is required to remove flakiness and then repair the test code on that basis. One key idea is to guide the repair process with additional knowledge about the test's flakiness in the form of its predicted fix category.
Score: 0.5749787074942512
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Flaky tests are problematic because they non-deterministically pass or fail for the same software version under test, causing confusion and wasting development effort. While machine learning models have been used to predict flakiness and its root causes, there is much less work on providing support to fix the problem. To address this gap, in this paper, we focus on predicting the type of fix that is required to remove flakiness and then repair the test code on that basis. We do this for a subset of flaky tests where the root cause of flakiness is in the test itself and not in the production code. One key idea is to guide the repair process with additional knowledge about the test's flakiness in the form of its predicted fix category. Thus, we first propose a framework that automatically generates labeled datasets for 13 fix categories and trains models to predict the fix category of a flaky test by analyzing the test code only. Our experimental results using code models and few-shot learning show that we can correctly predict most of the fix categories. To show the usefulness of such fix category labels for automatically repairing flakiness, we augment the prompts of GPT-3.5 Turbo, a Large Language Model (LLM), with such extra knowledge to request repair suggestions. The results show that our suggested fix category labels, complemented with in-context learning, significantly enhance the capability of GPT-3.5 Turbo in generating fixes for flaky tests. Based on the execution and analysis of a sample of GPT-repaired flaky tests, we estimate that a large percentage of such repairs (roughly between 51% and 83%) can be expected to pass. For the failing repaired tests, on average, 16% of the test code needs to be further changed for them to pass.

Related papers

Systemic Flakiness: An Empirical Analysis of Co-Occurring Flaky Test Failures [6.824747267214373]
Flaky tests produce inconsistent outcomes without code changes. Developers spend 1.28% of their time repairing flaky tests at a monthly cost of $2,250. We show that flaky tests often exist in clusters, with co-occurring failures that share the same root causes, which we call systemic flakiness.
arXiv Detail & Related papers (2025-04-23T14:51:23Z)
UTFix: Change Aware Unit Test Repairing using LLM [24.12850207529614]
We present UTFix, a novel approach for repairing unit tests when their corresponding focal methods undergo changes. Our approach leverages language models to repair unit tests by providing contextual information such as static code slices, dynamic code slices, and failure messages. To the best of our knowledge, this is the first comprehensive study focused on unit test in evolving Python projects.
arXiv Detail & Related papers (2025-03-19T06:10:03Z)
Learning to Generate Unit Tests for Automated Debugging [52.63217175637201]
Unit tests (UTs) play an instrumental role in assessing code correctness as well as providing feedback to large language models (LLMs) We propose UTGen, which teaches LLMs to generate unit test inputs that reveal errors along with their correct expected outputs. We show that UTGen outperforms other LLM-based baselines by 7.59% based on a metric measuring the presence of both error-revealing UT inputs and correct UT outputs.
arXiv Detail & Related papers (2025-02-03T18:51:43Z)
Improving LLM-based Unit test generation via Template-based Repair [8.22619177301814]
Unit test is crucial for detecting bugs in individual program units but consumes time and effort. Large language models (LLMs) have demonstrated remarkable reasoning and generation capabilities. In this paper, we propose TestART, a novel unit test generation method.
arXiv Detail & Related papers (2024-08-06T10:52:41Z)
Fix the Tests: Augmenting LLMs to Repair Test Cases with Static Collector and Neural Reranker [9.428021853841296]
We propose SYNTER, a novel approach to automatically repair obsolete test cases via precise and concise TROCtxs construction. With the augmentation of constructed TROCtxs, hallucinations are reduced by 57.1%.
arXiv Detail & Related papers (2024-07-04T04:24:43Z)
A Generic Approach to Fix Test Flakiness in Real-World Projects [7.122378689356857]
FlakyDoctor is a neuro-symbolic technique that combines the power of LLMs-generalizability-and program analysis-soundness to fix different types of test flakiness. Comparing to three alternative flakiness repair approaches, FlakyDoctor can repair 8% more ID tests than DexFix, 12% more OD flaky tests than OD, and 17% more OD flaky tests than iFixFlakies.
arXiv Detail & Related papers (2024-04-15T01:07:57Z)
FlaKat: A Machine Learning-Based Categorization Framework for Flaky Tests [3.0846824529023382]
Flaky tests can pass or fail non-deterministically, without alterations to a software system. State-of-the-art research incorporates machine learning solutions into flaky test detection and achieves reasonably good accuracy.
arXiv Detail & Related papers (2024-03-01T22:00:44Z)
GPT-HateCheck: Can LLMs Write Better Functional Tests for Hate Speech Detection? [50.53312866647302]
HateCheck is a suite for testing fine-grained model functionalities on synthesized data. We propose GPT-HateCheck, a framework to generate more diverse and realistic functional tests from scratch. Crowd-sourced annotation demonstrates that the generated test cases are of high quality.
arXiv Detail & Related papers (2024-02-23T10:02:01Z)
CAFA: Class-Aware Feature Alignment for Test-Time Adaptation [50.26963784271912]
Test-time adaptation (TTA) aims to address this challenge by adapting a model to unlabeled data at test time. We propose a simple yet effective feature alignment loss, termed as Class-Aware Feature Alignment (CAFA), which simultaneously encourages a model to learn target representations in a class-discriminative manner.
arXiv Detail & Related papers (2022-06-01T03:02:07Z)
Break-It-Fix-It: Unsupervised Learning for Program Repair [90.55497679266442]
We propose a new training approach, Break-It-Fix-It (BIFI), which has two key ideas. We use the critic to check a fixer's output on real bad inputs and add good (fixed) outputs to the training data. Based on these ideas, we iteratively update the breaker and the fixer while using them in conjunction to generate more paired data. BIFI outperforms existing methods, obtaining 90.5% repair accuracy on GitHub-Python and 71.7% on DeepFix.
arXiv Detail & Related papers (2021-06-11T20:31:04Z)
What is the Vocabulary of Flaky Tests? An Extended Replication [0.0]
We conduct an empirical study to assess the use of code identifiers to predict test flakiness. We validated the performance of trained models using datasets with other flaky tests and from different projects.
arXiv Detail & Related papers (2021-03-23T16:42:22Z)
Coping with Label Shift via Distributionally Robust Optimisation [72.80971421083937]
We propose a model that minimises an objective based on distributionally robust optimisation (DRO) We then design and analyse a gradient descent-proximal mirror ascent algorithm tailored for large-scale problems to optimise the proposed objective.
arXiv Detail & Related papers (2020-10-23T08:33:04Z)
TACRED Revisited: A Thorough Evaluation of the TACRED Relation Extraction Task [80.38130122127882]
TACRED is one of the largest, most widely used crowdsourced datasets in Relation Extraction (RE) In this paper, we investigate the questions: Have we reached a performance ceiling or is there still room for improvement? We find that label errors account for 8% absolute F1 test error, and that more than 50% of the examples need to be relabeled.
arXiv Detail & Related papers (2020-04-30T15:07:37Z)
Noisy Adaptive Group Testing using Bayesian Sequential Experimental Design [63.48989885374238]
When the infection prevalence of a disease is low, Dorfman showed 80 years ago that testing groups of people can prove more efficient than testing people individually. Our goal in this paper is to propose new group testing algorithms that can operate in a noisy setting.
arXiv Detail & Related papers (2020-04-26T23:41:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.