On the Costs and Benefits of Adopting Lifelong Learning for Software
Analytics -- Empirical Study on Brown Build and Risk Prediction
- URL: http://arxiv.org/abs/2305.09824v2
- Date: Mon, 12 Feb 2024 17:43:07 GMT
- Title: On the Costs and Benefits of Adopting Lifelong Learning for Software
Analytics -- Empirical Study on Brown Build and Risk Prediction
- Authors: Doriane Olewicki, Sarra Habchi, Mathieu Nayrolles, Mojtaba Faramarzi,
Sarath Chandar, Bram Adams
- Abstract summary: This paper evaluates the use of lifelong learning (LL) for industrial use cases at Ubisoft.
LL is used to continuously build and maintain ML-based software analytics tools using an incremental learner that progressively updates the old model using new data.
- Score: 17.502553991799832
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Nowadays, software analytics tools using machine learning (ML) models to, for
example, predict the risk of a code change are well established. However, as
the goals of a project shift over time, and developers and their habits change,
the performance of said models tends to degrade (drift) over time. Current
retraining practices typically require retraining a new model from scratch on a
large updated dataset when performance decay is observed, thus incurring a
computational cost; also there is no continuity between the models as the past
model is discarded and ignored during the new model training. Even though the
literature has taken interest in online learning approaches, those have rarely
been integrated and evaluated in industrial environments. This paper evaluates
the use of lifelong learning (LL) for industrial use cases at Ubisoft,
evaluating both the performance and the required computational effort in
comparison to the retraining-from-scratch approaches commonly used by the
industry. LL is used to continuously build and maintain ML-based software
analytics tools using an incremental learner that progressively updates the old
model using new data. To avoid so-called "catastrophic forgetting" of important
older data points, we adopt a replay buffer of older data, which still allows
us to drastically reduce the size of the overall training dataset, and hence
model training time.
Related papers
- Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data.
We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z) - Temporal Knowledge Distillation for Time-Sensitive Financial Services
Applications [7.1795069620810805]
Anomaly detection is frequently used in key compliance and risk functions such as financial crime detection fraud and cybersecurity.
Keeping up with the rapid changes by retraining the models with the latest data patterns introduces pressures in balancing the historical and current patterns.
The proposed approach provides advantages in retraining times while improving the model performance.
arXiv Detail & Related papers (2023-12-28T03:04:30Z) - Learn to Unlearn for Deep Neural Networks: Minimizing Unlearning
Interference with Gradient Projection [56.292071534857946]
Recent data-privacy laws have sparked interest in machine unlearning.
Challenge is to discard information about the forget'' data without altering knowledge about remaining dataset.
We adopt a projected-gradient based learning method, named as Projected-Gradient Unlearning (PGU)
We provide empirically evidence to demonstrate that our unlearning method can produce models that behave similar to models retrained from scratch across various metrics even when the training dataset is no longer accessible.
arXiv Detail & Related papers (2023-12-07T07:17:24Z) - QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement.
QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights.
We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z) - PILOT: A Pre-Trained Model-Based Continual Learning Toolbox [71.63186089279218]
This paper introduces a pre-trained model-based continual learning toolbox known as PILOT.
On the one hand, PILOT implements some state-of-the-art class-incremental learning algorithms based on pre-trained models, such as L2P, DualPrompt, and CODA-Prompt.
On the other hand, PILOT fits typical class-incremental learning algorithms within the context of pre-trained models to evaluate their effectiveness.
arXiv Detail & Related papers (2023-09-13T17:55:11Z) - Mitigating ML Model Decay in Continuous Integration with Data Drift
Detection: An Empirical Study [7.394099294390271]
This study aims to investigate the performance of using data drift detection techniques for automatically detecting the retraining points for ML models for TCP in CI environments.
We employed the Hellinger distance to identify changes in both the values and distribution of input data and leveraged these changes as retraining points for the ML model.
Our experimental evaluation of the Hellinger distance-based method demonstrated its efficacy and efficiency in detecting retraining points and reducing the associated costs.
arXiv Detail & Related papers (2023-05-22T05:55:23Z) - Robustness-preserving Lifelong Learning via Dataset Condensation [11.83450966328136]
'catastrophic forgetting' refers to a notorious dilemma between improving model accuracy over new data and retaining accuracy over previous data.
We propose a new memory-replay LL strategy that leverages modern bi-level optimization techniques to determine the 'coreset' of the current data.
We term the resulting LL framework 'Data-Efficient Robustness-Preserving LL' (DERPLL)
Experimental results show that DERPLL outperforms the conventional coreset-guided LL baseline.
arXiv Detail & Related papers (2023-03-07T19:09:03Z) - Effective and Efficient Training for Sequential Recommendation using
Recency Sampling [91.02268704681124]
We propose a novel Recency-based Sampling of Sequences training objective.
We show that the models enhanced with our method can achieve performances exceeding or very close to stateof-the-art BERT4Rec.
arXiv Detail & Related papers (2022-07-06T13:06:31Z) - Model Reprogramming: Resource-Efficient Cross-Domain Machine Learning [65.268245109828]
In data-rich domains such as vision, language, and speech, deep learning prevails to deliver high-performance task-specific models.
Deep learning in resource-limited domains still faces multiple challenges including (i) limited data, (ii) constrained model development cost, and (iii) lack of adequate pre-trained models for effective finetuning.
Model reprogramming enables resource-efficient cross-domain machine learning by repurposing a well-developed pre-trained model from a source domain to solve tasks in a target domain without model finetuning.
arXiv Detail & Related papers (2022-02-22T02:33:54Z) - Passive learning to address nonstationarity in virtual flow metering
applications [0.0]
This paper explores how learning methods can be applied to sustain the prediction accuracy of steady-state virtual flow meters.
Two passive learning methods, periodic batch learning and online learning, are applied with varying calibration frequency to train virtual flow meters.
The results are two-fold: first, in the presence of frequently arriving measurements, frequent model updating sustains an excellent prediction performance over time; second, in the presence of intermittent and infrequently arriving measurements, frequent updating is essential to increase the performance accuracy.
arXiv Detail & Related papers (2022-02-07T14:42:00Z) - Lambda Learner: Fast Incremental Learning on Data Streams [5.543723668681475]
We propose a new framework for training models by incremental updates in response to mini-batches from data streams.
We show that the resulting model of our framework closely estimates a periodically updated model trained on offline data and outperforms it when model updates are time-sensitive.
We present a large-scale deployment on the sponsored content platform for a large social network.
arXiv Detail & Related papers (2020-10-11T04:00:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.