Scaling Laws Behind Code Understanding Model
- URL: http://arxiv.org/abs/2402.12813v1
- Date: Tue, 20 Feb 2024 08:31:42 GMT
- Title: Scaling Laws Behind Code Understanding Model
- Authors: Jiayi Lin, Hande Dong, Yutao Xie, Lei Zhang
- Abstract summary: We study the scaling law for the code understanding task by varying training data, model size, and computing resource.
We train a large-scale code understanding model named CoLSBERT with 1.5B parameters on a large dataset using more computing resource, which outperforms previous work by a large margin.
- Score: 4.846512516189021
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The scaling law is becoming a fundamental law in many machine learning areas.
That is, test error falls off with the power law when increasing training data,
model size, and computing resource. However, whether this law is suitable for
the task of code understanding is not well studied, and most current language
models for code understanding are about 100M parameters, which are relatively
"small" compared to large language models. In this paper, we conduct extensive
experiments to investigate the scaling law for the code understanding task by
varying training data, model size, and computing resource. We validate that the
test error of code understanding models falls off with the power law when using
larger models, indicating that the scaling law is suitable for the code
understanding task. Besides, we apply different scales of models to two
downstream code understanding tasks, and find that the performance increases
with larger scale of models. Finally, we train a large-scale code understanding
model named CoLSBERT with 1.5B parameters on a large dataset using more
computing resource, which outperforms previous work by a large margin. We will
release our code and the CoLSBERT model when our paper is published.
Related papers
- Bayesian scaling laws for in-context learning [72.17734205418502]
In-context learning (ICL) is a powerful technique for getting language models to perform complex tasks with no training updates.
We show that ICL approximates a Bayesian learner and develop a family of novel Bayesian scaling laws for ICL.
arXiv Detail & Related papers (2024-10-21T21:45:22Z) - A Hitchhiker's Guide to Scaling Law Estimation [56.06982415792523]
Scaling laws predict the loss of a target machine learning model by extrapolating from easier-to-train models with fewer parameters or smaller training sets.
We estimate more than 1000 scaling laws, then derive a set of best practices for estimating scaling laws in new model families.
arXiv Detail & Related papers (2024-10-15T17:59:10Z) - Language models scale reliably with over-training and on downstream tasks [121.69867718185125]
Scaling laws are useful guides for derisking expensive training runs.
However, there remain gaps between current studies and how language models are trained.
In contrast, scaling laws mostly predict loss on inference, but models are usually compared on downstream task performance.
arXiv Detail & Related papers (2024-03-13T13:54:00Z) - Code Representation Learning At Scale [75.04686476303436]
We fuel code representation learning with a vast amount of code data via a two-stage pretraining scheme.
We first train the encoders via a mix that leverages both randomness in masking language modeling and the structure aspect of programming language.
We then enhance the representations via contrastive learning with hard negative and hard positive constructed in an unsupervised manner.
arXiv Detail & Related papers (2024-02-02T22:19:15Z) - A Dynamical Model of Neural Scaling Laws [79.59705237659547]
We analyze a random feature model trained with gradient descent as a solvable model of network training and generalization.
Our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.
arXiv Detail & Related papers (2024-02-02T01:41:38Z) - Greener yet Powerful: Taming Large Code Generation Models with
Quantization [47.734976584580224]
Large pretrained deep learning models have substantially pushed the boundary of code generation.
Despite their great power, the huge number of model parameters poses a significant threat to adapting them in a regular software development environment.
Model compression is a promising approach to address these challenges.
arXiv Detail & Related papers (2023-03-09T16:25:51Z) - Reproducible scaling laws for contrastive language-image learning [42.354402731615444]
We investigate scaling laws for contrastive language-image pre-training (CLIP) with the public LAION dataset and the open-source OpenCLIP repository.
Our large-scale experiments involve models trained on up to two billion image-text pairs and identify power law scaling for multiple downstream tasks.
We find that the training distribution plays a key role in scaling laws as the OpenAI and OpenCLIP models exhibit different scaling behavior despite identical model architectures.
arXiv Detail & Related papers (2022-12-14T10:24:50Z) - Understanding Scaling Laws for Recommendation Models [1.6283945233720964]
We study empirical scaling laws for DLRM style recommendation models, in particular Click-Through Rate (CTR)
We characterize scaling efficiency along three different resource dimensions, namely data, parameters and compute.
We show that parameter scaling is out of steam for the model architecture under study, and until a higher-performing model architecture emerges, data scaling is the path forward.
arXiv Detail & Related papers (2022-08-17T19:13:17Z) - Scaling Laws Under the Microscope: Predicting Transformer Performance
from Small Scale Experiments [42.793379799720434]
We investigate whether scaling laws can be used to accelerate model development.
We find that scaling laws emerge at finetuning time in some NLP tasks.
For tasks where scaling laws exist, they can be used to predict the performance of larger models.
arXiv Detail & Related papers (2022-02-13T19:13:00Z) - Scaling Laws for Acoustic Models [7.906034575114518]
Recent work has shown that autoregressive generative models with cross-entropy objective functions exhibit smooth power-law relationships.
We show that acoustic models trained with an auto-predictive coding loss behave as if they are subject to similar scaling laws.
arXiv Detail & Related papers (2021-06-11T18:59:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.