Related papers: An Analysis of LLM Fine-Tuning and Few-Shot Learning for Flaky Test Detection and Classification

An Analysis of LLM Fine-Tuning and Few-Shot Learning for Flaky Test Detection and Classification

URL: http://arxiv.org/abs/2502.02715v1
Date: Tue, 04 Feb 2025 20:54:51 GMT
Title: An Analysis of LLM Fine-Tuning and Few-Shot Learning for Flaky Test Detection and Classification
Authors: Riddhi More, Jeremy S. Bradbury,
Abstract summary: Flaky tests exhibit non-deterministic behavior during execution.<n>Flaky test detection and classification is challenging due to the variability in test behavior.
Score: 1.9336815376402723
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Flaky tests exhibit non-deterministic behavior during execution and they may pass or fail without any changes to the program under test. Detecting and classifying these flaky tests is crucial for maintaining the robustness of automated test suites and ensuring the overall reliability and confidence in the testing. However, flaky test detection and classification is challenging due to the variability in test behavior, which can depend on environmental conditions and subtle code interactions. Large Language Models (LLMs) offer promising approaches to address this challenge, with fine-tuning and few-shot learning (FSL) emerging as viable techniques. With enough data fine-tuning a pre-trained LLM can achieve high accuracy, making it suitable for organizations with more resources. Alternatively, we introduce FlakyXbert, an FSL approach that employs a Siamese network architecture to train efficiently with limited data. To understand the performance and cost differences between these two methods, we compare fine-tuning on larger datasets with FSL in scenarios restricted by smaller datasets. Our evaluation involves two existing flaky test datasets, FlakyCat and IDoFT. Our results suggest that while fine-tuning can achieve high accuracy, FSL provides a cost-effective approach with competitive accuracy, which is especially beneficial for organizations or projects with limited historical data available for training. These findings underscore the viability of both fine-tuning and FSL in flaky test detection and classification with each suited to different organizational needs and resource availability.

Related papers

FATE: A Prompt-Tuning-Based Semi-Supervised Learning Framework for Extremely Limited Labeled Data [36.21759320898034]
Semi-supervised learning (SSL) has achieved significant progress by leveraging both labeled data and unlabeled data. We propose Firstly Adapt, Then catEgorize (FATE), a novel SSL framework tailored for scenarios with extremely limited labeled data. FATE exploits unlabeled data to compensate for scarce supervision signals, then transfers to downstream tasks.
arXiv Detail & Related papers (2025-04-14T02:54:28Z)
Investigating the Impact of Hard Samples on Accuracy Reveals In-class Data Imbalance [4.291589126905706]
In the AutoML domain, test accuracy is heralded as the quintessential metric for evaluating model efficacy. However, the reliability of test accuracy as the primary performance metric has been called into question. The distribution of hard samples between training and test sets affects the difficulty levels of those sets. We propose a benchmarking procedure for comparing hard sample identification methods.
arXiv Detail & Related papers (2024-09-22T11:38:14Z)
Training on the Benchmark Is Not All You Need [52.01920740114261]
We propose a simple and effective data leakage detection method based on the contents of multiple-choice options. Our method is able to work under black-box conditions without access to model training data or weights. We evaluate the degree of data leakage of 31 mainstream open-source LLMs on four benchmark datasets.
arXiv Detail & Related papers (2024-09-03T11:09:44Z)
A Closer Look at Benchmarking Self-Supervised Pre-training with Image Classification [51.35500308126506]
Self-supervised learning (SSL) is a machine learning approach where the data itself provides supervision, eliminating the need for external labels. We study how classification-based evaluation protocols for SSL correlate and how well they predict downstream performance on different dataset types.
arXiv Detail & Related papers (2024-07-16T23:17:36Z)
Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios. We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples. Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z)
FlaKat: A Machine Learning-Based Categorization Framework for Flaky Tests [3.0846824529023382]
Flaky tests can pass or fail non-deterministically, without alterations to a software system. State-of-the-art research incorporates machine learning solutions into flaky test detection and achieves reasonably good accuracy.
arXiv Detail & Related papers (2024-03-01T22:00:44Z)
Automating Dataset Updates Towards Reliable and Timely Evaluation of Large Language Models [81.27391252152199]
Large language models (LLMs) have achieved impressive performance across various natural language benchmarks. We propose to automate dataset updating and provide systematic analysis regarding its effectiveness. There are two updating strategies: 1) mimicking strategy to generate similar samples based on original data, and 2) extending strategy that further expands existing samples.
arXiv Detail & Related papers (2024-02-19T07:15:59Z)
An Adaptive Plug-and-Play Network for Few-Shot Learning [12.023266104119289]
Few-shot learning requires a model to classify new samples after learning from only a few samples. Deep networks and complex metrics tend to induce overfitting, making it difficult to further improve the performance. We propose plug-and-play model-adaptive resizer (MAR) and adaptive similarity metric (ASM) without any other losses.
arXiv Detail & Related papers (2023-02-18T13:25:04Z)
Benchmark for Uncertainty & Robustness in Self-Supervised Learning [0.0]
Self-Supervised Learning is crucial for real-world applications, especially in data-hungry domains such as healthcare and self-driving cars. In this paper, we explore variants of SSL methods, including Jigsaw Puzzles, Context, Rotation, Geometric Transformations Prediction for vision, as well as BERT and GPT for language tasks. Our goal is to create a benchmark with outputs from experiments, providing a starting point for new SSL methods in Reliable Machine Learning.
arXiv Detail & Related papers (2022-12-23T15:46:23Z)
CAFA: Class-Aware Feature Alignment for Test-Time Adaptation [50.26963784271912]
Test-time adaptation (TTA) aims to address this challenge by adapting a model to unlabeled data at test time. We propose a simple yet effective feature alignment loss, termed as Class-Aware Feature Alignment (CAFA), which simultaneously encourages a model to learn target representations in a class-discriminative manner.
arXiv Detail & Related papers (2022-06-01T03:02:07Z)
Open-Set Semi-Supervised Learning for 3D Point Cloud Understanding [62.17020485045456]
It is commonly assumed in semi-supervised learning (SSL) that the unlabeled data are drawn from the same distribution as that of the labeled ones. We propose to selectively utilize unlabeled data through sample weighting, so that only conducive unlabeled data would be prioritized.
arXiv Detail & Related papers (2022-05-02T16:09:17Z)
Hybrid Consistency Training with Prototype Adaptation for Few-Shot Learning [11.873143649261362]
Few-Shot Learning aims to improve a model's generalization capability in low data regimes. Recent FSL works have made steady progress via metric learning, meta learning, representation learning, etc.
arXiv Detail & Related papers (2020-11-19T19:51:33Z)
Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples. We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries. We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.