Related papers: Active Code Learning: Benchmarking Sample-Efficient Training of Code Models

Active Code Learning: Benchmarking Sample-Efficient Training of Code Models

URL: http://arxiv.org/abs/2306.01250v1
Date: Fri, 2 Jun 2023 03:26:11 GMT
Title: Active Code Learning: Benchmarking Sample-Efficient Training of Code Models
Authors: Qiang Hu, Yuejun Guo, Xiaofei Xie, Maxime Cordy, Lei Ma, Mike Papadakis, and Yves Le Traon
Abstract summary: In software engineering (ML4Code), efficiently training models of code with less human effort has become an emergent problem. Active learning is such a technique that allows developers to train a model with reduced data while producing models with desired performance. This paper builds the first benchmark to study this critical problem - active code learning.
Score: 35.54965391159943
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The costly human effort required to prepare the training data of machine learning (ML) models hinders their practical development and usage in software engineering (ML4Code), especially for those with limited budgets. Therefore, efficiently training models of code with less human effort has become an emergent problem. Active learning is such a technique to address this issue that allows developers to train a model with reduced data while producing models with desired performance, which has been well studied in computer vision and natural language processing domains. Unfortunately, there is no such work that explores the effectiveness of active learning for code models. In this paper, we bridge this gap by building the first benchmark to study this critical problem - active code learning. Specifically, we collect 11 acquisition functions~(which are used for data selection in active learning) from existing works and adapt them for code-related tasks. Then, we conduct an empirical study to check whether these acquisition functions maintain performance for code data. The results demonstrate that feature selection highly affects active learning and using output vectors to select data is the best choice. For the code summarization task, active code learning is ineffective which produces models with over a 29.64\% gap compared to the expected performance. Furthermore, we explore future directions of active code learning with an exploratory study. We propose to replace distance calculation methods with evaluation metrics and find a correlation between these evaluation-based distance methods and the performance of code models.

Related papers

Should Code Models Learn Pedagogically? A Preliminary Evaluation of Curriculum Learning for Real-World Software Engineering Tasks [2.0072624123275533]
Recent work shows that Curriculum Learning can improve performance on code-related tasks through incremental learning based on the difficulty of synthetic code. We investigate how the pre-trained code model (CodeT5) learns under CL, through the tasks of code clone detection and code summarization. Our empirical study on the CodeXGLUE benchmark showed contrasting results to prior studies, where the model exhibited signs of catastrophic forgetting and shortcut learning.
arXiv Detail & Related papers (2025-02-06T06:33:08Z)
Attribute-to-Delete: Machine Unlearning via Datamodel Matching [65.13151619119782]
Machine unlearning -- efficiently removing a small "forget set" training data on a pre-divertrained machine learning model -- has recently attracted interest. Recent research shows that machine unlearning techniques do not hold up in such a challenging setting.
arXiv Detail & Related papers (2024-10-30T17:20:10Z)
regAL: Python Package for Active Learning of Regression Problems [0.0]
Python package regAL allows users to evaluate different active learning strategies for regression problems. We present our Python package regAL, which allows users to evaluate different active learning strategies for regression problems.
arXiv Detail & Related papers (2024-10-23T14:34:36Z)
EmbedLLM: Learning Compact Representations of Large Language Models [28.49433308281983]
We propose EmbedLLM, a framework designed to learn compact vector representations of Large Language Models. We introduce an encoder-decoder approach for learning such embeddings, along with a systematic framework to evaluate their effectiveness. Empirical results show that EmbedLLM outperforms prior methods in model routing both in accuracy and latency.
arXiv Detail & Related papers (2024-10-03T05:43:24Z)
Code Representation Learning At Scale [75.04686476303436]
We fuel code representation learning with a vast amount of code data via a two-stage pretraining scheme. We first train the encoders via a mix that leverages both randomness in masking language modeling and the structure aspect of programming language. We then enhance the representations via contrastive learning with hard negative and hard positive constructed in an unsupervised manner.
arXiv Detail & Related papers (2024-02-02T22:19:15Z)
Benchmarking Learning Efficiency in Deep Reservoir Computing [23.753943709362794]
We introduce a benchmark of increasingly difficult tasks together with a data efficiency metric to measure how quickly machine learning models learn from training data. We compare the learning speed of some established sequential supervised models, such as RNNs, LSTMs, or Transformers, with relatively less known alternative models based on reservoir computing.
arXiv Detail & Related papers (2022-09-29T08:16:52Z)
ALBench: A Framework for Evaluating Active Learning in Object Detection [102.81795062493536]
This paper contributes an active learning benchmark framework named as ALBench for evaluating active learning in object detection. Developed on an automatic deep model training system, this ALBench framework is easy-to-use, compatible with different active learning algorithms, and ensures the same training and testing protocols.
arXiv Detail & Related papers (2022-07-27T07:46:23Z)
Enhancing Semantic Code Search with Multimodal Contrastive Learning and Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search. We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z)
What Makes Good Contrastive Learning on Small-Scale Wearable-based Tasks? [59.51457877578138]
We study contrastive learning on the wearable-based activity recognition task. This paper presents an open-source PyTorch library textttCL-HAR, which can serve as a practical tool for researchers.
arXiv Detail & Related papers (2022-02-12T06:10:15Z)
Learnability of Learning Performance and Its Application to Data Valuation [11.78594243870616]
In most machine learning (ML) tasks, evaluating learning performance on a given dataset requires intensive computation. The ability to efficiently estimate learning performance may benefit a wide spectrum of applications, such as active learning, data quality management, and data valuation. Recent empirical studies show that for many common ML models, one can accurately learn a parametric model that predicts learning performance for any given input datasets using a small amount of samples.
arXiv Detail & Related papers (2021-07-13T18:56:04Z)
Probabilistic Active Meta-Learning [15.432006404678981]
We introduce task selection based on prior experience into a meta-learning algorithm. We provide empirical evidence that our approach improves data-efficiency when compared to strong baselines on simulated robotic experiments.
arXiv Detail & Related papers (2020-07-17T12:51:42Z)
Bayesian active learning for production, a systematic study and a reusable library [85.32971950095742]
In this paper, we analyse the main drawbacks of current active learning techniques. We do a systematic study on the effects of the most common issues of real-world datasets on the deep active learning process. We derive two techniques that can speed up the active learning loop such as partial uncertainty sampling and larger query size.
arXiv Detail & Related papers (2020-06-17T14:51:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.