Active Code Learning: Benchmarking Sample-Efficient Training of Code
Models
- URL: http://arxiv.org/abs/2306.01250v1
- Date: Fri, 2 Jun 2023 03:26:11 GMT
- Title: Active Code Learning: Benchmarking Sample-Efficient Training of Code
Models
- Authors: Qiang Hu, Yuejun Guo, Xiaofei Xie, Maxime Cordy, Lei Ma, Mike
Papadakis, and Yves Le Traon
- Abstract summary: In software engineering (ML4Code), efficiently training models of code with less human effort has become an emergent problem.
Active learning is such a technique that allows developers to train a model with reduced data while producing models with desired performance.
This paper builds the first benchmark to study this critical problem - active code learning.
- Score: 35.54965391159943
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The costly human effort required to prepare the training data of machine
learning (ML) models hinders their practical development and usage in software
engineering (ML4Code), especially for those with limited budgets. Therefore,
efficiently training models of code with less human effort has become an
emergent problem. Active learning is such a technique to address this issue
that allows developers to train a model with reduced data while producing
models with desired performance, which has been well studied in computer vision
and natural language processing domains. Unfortunately, there is no such work
that explores the effectiveness of active learning for code models. In this
paper, we bridge this gap by building the first benchmark to study this
critical problem - active code learning. Specifically, we collect 11
acquisition functions~(which are used for data selection in active learning)
from existing works and adapt them for code-related tasks. Then, we conduct an
empirical study to check whether these acquisition functions maintain
performance for code data. The results demonstrate that feature selection
highly affects active learning and using output vectors to select data is the
best choice. For the code summarization task, active code learning is
ineffective which produces models with over a 29.64\% gap compared to the
expected performance. Furthermore, we explore future directions of active code
learning with an exploratory study. We propose to replace distance calculation
methods with evaluation metrics and find a correlation between these
evaluation-based distance methods and the performance of code models.
Related papers
- Attribute-to-Delete: Machine Unlearning via Datamodel Matching [65.13151619119782]
Machine unlearning -- efficiently removing a small "forget set" training data on a pre-divertrained machine learning model -- has recently attracted interest.
Recent research shows that machine unlearning techniques do not hold up in such a challenging setting.
arXiv Detail & Related papers (2024-10-30T17:20:10Z) - regAL: Python Package for Active Learning of Regression Problems [0.0]
Python package regAL allows users to evaluate different active learning strategies for regression problems.
We present our Python package regAL, which allows users to evaluate different active learning strategies for regression problems.
arXiv Detail & Related papers (2024-10-23T14:34:36Z) - EmbedLLM: Learning Compact Representations of Large Language Models [28.49433308281983]
We propose EmbedLLM, a framework designed to learn compact vector representations of Large Language Models.
We introduce an encoder-decoder approach for learning such embeddings, along with a systematic framework to evaluate their effectiveness.
Empirical results show that EmbedLLM outperforms prior methods in model routing both in accuracy and latency.
arXiv Detail & Related papers (2024-10-03T05:43:24Z) - Code Representation Learning At Scale [75.04686476303436]
We fuel code representation learning with a vast amount of code data via a two-stage pretraining scheme.
We first train the encoders via a mix that leverages both randomness in masking language modeling and the structure aspect of programming language.
We then enhance the representations via contrastive learning with hard negative and hard positive constructed in an unsupervised manner.
arXiv Detail & Related papers (2024-02-02T22:19:15Z) - Benchmarking Learning Efficiency in Deep Reservoir Computing [23.753943709362794]
We introduce a benchmark of increasingly difficult tasks together with a data efficiency metric to measure how quickly machine learning models learn from training data.
We compare the learning speed of some established sequential supervised models, such as RNNs, LSTMs, or Transformers, with relatively less known alternative models based on reservoir computing.
arXiv Detail & Related papers (2022-09-29T08:16:52Z) - ALBench: A Framework for Evaluating Active Learning in Object Detection [102.81795062493536]
This paper contributes an active learning benchmark framework named as ALBench for evaluating active learning in object detection.
Developed on an automatic deep model training system, this ALBench framework is easy-to-use, compatible with different active learning algorithms, and ensures the same training and testing protocols.
arXiv Detail & Related papers (2022-07-27T07:46:23Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - What Makes Good Contrastive Learning on Small-Scale Wearable-based
Tasks? [59.51457877578138]
We study contrastive learning on the wearable-based activity recognition task.
This paper presents an open-source PyTorch library textttCL-HAR, which can serve as a practical tool for researchers.
arXiv Detail & Related papers (2022-02-12T06:10:15Z) - Learnability of Learning Performance and Its Application to Data
Valuation [11.78594243870616]
In most machine learning (ML) tasks, evaluating learning performance on a given dataset requires intensive computation.
The ability to efficiently estimate learning performance may benefit a wide spectrum of applications, such as active learning, data quality management, and data valuation.
Recent empirical studies show that for many common ML models, one can accurately learn a parametric model that predicts learning performance for any given input datasets using a small amount of samples.
arXiv Detail & Related papers (2021-07-13T18:56:04Z) - Probabilistic Active Meta-Learning [15.432006404678981]
We introduce task selection based on prior experience into a meta-learning algorithm.
We provide empirical evidence that our approach improves data-efficiency when compared to strong baselines on simulated robotic experiments.
arXiv Detail & Related papers (2020-07-17T12:51:42Z) - Bayesian active learning for production, a systematic study and a
reusable library [85.32971950095742]
In this paper, we analyse the main drawbacks of current active learning techniques.
We do a systematic study on the effects of the most common issues of real-world datasets on the deep active learning process.
We derive two techniques that can speed up the active learning loop such as partial uncertainty sampling and larger query size.
arXiv Detail & Related papers (2020-06-17T14:51:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.