Related papers: Towards Using Data-Centric Approach for Better Code Representation Learning

Towards Using Data-Centric Approach for Better Code Representation Learning

URL: http://arxiv.org/abs/2205.13022v1
Date: Wed, 25 May 2022 19:19:21 GMT
Title: Towards Using Data-Centric Approach for Better Code Representation Learning
Authors: Anh Dau, Thang Nguyen-Duc, Hoang Thanh-Tung, Nghi Bui
Abstract summary: We focus on improving existing code learning models from the data-centric perspective. We use a so-called data-influence method to identify noisy samples of pre-trained code learning models.
Score: 1.1470070927586016
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite the recent trend of creating source code models and applying them to software engineering tasks, the quality of such models is insufficient for real-world application. In this work, we focus on improving existing code learning models from the data-centric perspective instead of designing new source code models. We shed some light on this direction by using a so-called data-influence method to identify noisy samples of pre-trained code learning models. The data-influence method is to assess the similarity of a target sample to the correct samples to determine whether or not such the target sample is noisy. The results of our evaluation show that data-influence methods can identify noisy samples for the code classification and defection prediction tasks. We envision that the data-centric approach will be a key driver for developing source code models that are useful in practice.

Related papers

Evaluating Sample Utility for Data Selection by Mimicking Model Weights [12.056542160711718]
Foundation models are trained on large-scale web-crawled datasets, which often contain noise, biases, and irrelevant information. We propose an efficient, model-based approach using the Mimic Score, a new data quality metric. We develop Grad-Mimic, a framework that prioritizes samples for learning, creates effective filters, and automates data selection.
arXiv Detail & Related papers (2025-01-12T04:28:14Z)
An Information Theoretic Approach to Machine Unlearning [45.600917449314444]
Key challenge in unlearning is forgetting the necessary data in a timely manner, while preserving model performance. In this work, we address the zero-shot unlearning scenario, whereby an unlearning algorithm must be able to remove data given only a trained model and the data to be forgotten. We derive a simple but principled zero-shot unlearning method based on the geometry of the model.
arXiv Detail & Related papers (2024-02-02T13:33:30Z)
Learn to Unlearn for Deep Neural Networks: Minimizing Unlearning Interference with Gradient Projection [56.292071534857946]
Recent data-privacy laws have sparked interest in machine unlearning. Challenge is to discard information about the forget'' data without altering knowledge about remaining dataset. We adopt a projected-gradient based learning method, named as Projected-Gradient Unlearning (PGU) We provide empirically evidence to demonstrate that our unlearning method can produce models that behave similar to models retrained from scratch across various metrics even when the training dataset is no longer accessible.
arXiv Detail & Related papers (2023-12-07T07:17:24Z)
Learning Defect Prediction from Unrealistic Data [57.53586547895278]
Pretrained models of code have become popular choices for code understanding and generation tasks. Such models tend to be large and require commensurate volumes of training data. It has become popular to train models with far larger but less realistic datasets, such as functions with artificially injected bugs. Models trained on such data tend to only perform well on similar data, while underperforming on real world programs.
arXiv Detail & Related papers (2023-11-02T01:51:43Z)
Strategies and impact of learning curve estimation for CNN-based image classification [0.2678472239880052]
Learning curves are a measure for how the performance of machine learning models improves given a certain volume of training data. Over a wide variety of applications and models it was observed that learning curves follow -- to a large extent -- a power law behavior. By estimating the learning curve of a model from training on small subsets of data only the best models need to be considered for training on the full dataset.
arXiv Detail & Related papers (2023-10-12T16:28:25Z)
Dual Student Networks for Data-Free Model Stealing [79.67498803845059]
Two main challenges are estimating gradients of the target model without access to its parameters, and generating a diverse set of training samples. We propose a Dual Student method where two students are symmetrically trained in order to provide the generator a criterion to generate samples that the two students disagree on. We show that our new optimization framework provides more accurate gradient estimation of the target model and better accuracies on benchmark classification datasets.
arXiv Detail & Related papers (2023-09-18T18:11:31Z)
Stream-based active learning with linear models [0.7734726150561089]
In production, instead of performing random inspections to obtain product information, labels are collected by evaluating the information content of the unlabeled data. We propose a new strategy for the stream-based scenario, where instances are sequentially offered to the learner. The iterative aspect of the decision-making process is tackled by setting a threshold on the informativeness of the unlabeled data points.
arXiv Detail & Related papers (2022-07-20T13:15:23Z)
Energy-bounded Learning for Robust Models of Code [16.592638312365164]
In programming, learning code representations has a variety of applications, including code classification, code search, comment generation, bug prediction, and so on. We propose the use of an energy-bounded learning objective function to assign a higher score to in-distribution samples and a lower score to out-of-distribution samples in order to incorporate such out-of-distribution samples into the training process of source code models.
arXiv Detail & Related papers (2021-12-20T06:28:56Z)
Distilling Interpretable Models into Human-Readable Code [71.11328360614479]
Human-readability is an important and desirable standard for machine-learned model interpretability. We propose to train interpretable models using conventional methods, and then distill them into concise, human-readable code. We describe a piecewise-linear curve-fitting algorithm that produces high-quality results efficiently and reliably across a broad range of use cases.
arXiv Detail & Related papers (2021-01-21T01:46:36Z)
Models, Pixels, and Rewards: Evaluating Design Trade-offs in Visual Model-Based Reinforcement Learning [109.74041512359476]
We study a number of design decisions for the predictive model in visual MBRL algorithms. We find that a range of design decisions that are often considered crucial, such as the use of latent spaces, have little effect on task performance. We show how this phenomenon is related to exploration and how some of the lower-scoring models on standard benchmarks will perform the same as the best-performing models when trained on the same training data.
arXiv Detail & Related papers (2020-12-08T18:03:21Z)
How Training Data Impacts Performance in Learning-based Control [67.7875109298865]
This paper derives an analytical relationship between the density of the training data and the control performance. We formulate a quality measure for the data set, which we refer to as $rho$-gap. We show how the $rho$-gap can be applied to a feedback linearizing control law.
arXiv Detail & Related papers (2020-05-25T12:13:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.