Scaling Laws for Code: A More Data-Hungry Regime
- URL: http://arxiv.org/abs/2510.08702v1
- Date: Thu, 09 Oct 2025 18:05:52 GMT
- Title: Scaling Laws for Code: A More Data-Hungry Regime
- Authors: Xianzhen Luo, Wenzhen Zheng, Qingfu Zhu, Rongyi Zhang, Houyi Li, Siming Huang, YuanTao Fan, Wanxiang Che,
- Abstract summary: scaling laws that guide the efficient training are predominantly analyzed on Natural Language (NL)<n>We conduct the first large-scale empirical study of scaling laws for code, comprising 117 experimental runs with model sizes from 0.2B to 3.8B and training tokens from 2B to 128B.
- Score: 43.20725601738161
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Code Large Language Models (LLMs) are revolutionizing software engineering. However, scaling laws that guide the efficient training are predominantly analyzed on Natural Language (NL). Given the fundamental differences like strict syntax between code and NL, it is unclear whether these laws are directly applicable to code. To address this gap, we conduct the first large-scale empirical study of scaling laws for code, comprising 117 experimental runs with model sizes from 0.2B to 3.8B and training tokens from 2B to 128B. We fit the Chinchilla law and the Farsser law. First, the results show that the more expressive Farseer law offers greater accuracy. Second, the analysis reveals that Code LLMs scale effectively with model size. Crucially, code represents a more data-hungry regime, requiring a substantially higher data-to-parameter ratio than NL. Finally, two additional sets of experiments on code-NL mixtures show that NL benefits resource-constrained scenarios, but becomes a detriment at higher compute budgets.
Related papers
- Semantic Tube Prediction: Beating LLM Data Efficiency with JEPA [50.494504099850325]
We introduce the Geodesic Hypothesis, positing that token sequences trace geodesics on a smooth semantic manifold and are therefore locally linear.<n>We show this constraint improves signal-to-noise ratio, and preserves diversity by preventing collisions during trajectory.<n>We demonstrate that geometric priors can surpass brute-force scaling.
arXiv Detail & Related papers (2026-02-26T04:45:07Z) - Towards Robust Scaling Laws for Optimizers [89.21160945066737]
Empirical scaling laws are widely used to predict loss as model size and training data grow.<n>We show that Chinchilla-style scaling laws emerge naturally as a result of loss decomposition into irreducible, approximation, and optimization errors.
arXiv Detail & Related papers (2026-02-07T21:40:33Z) - Scaling Laws for Code: Every Programming Language Matters [73.6302896079007]
Code large language models (Code LLMs) are powerful but costly to train.<n>Different programming languages (PLs) have varying impacts during pre-training that significantly affect base model performance.<n>We present the first systematic exploration of scaling laws for multilingual code pre-training.
arXiv Detail & Related papers (2025-12-15T16:07:34Z) - Relative Scaling Laws for LLMs [91.73497548097775]
Scaling laws describe how language models improve with additional data, parameters, and compute.<n>We introduce relative scaling laws, which track how performance gaps between test distributions evolve with scale.<n>These results show that although scaling improves overall performance, it is not a universal equalizer.
arXiv Detail & Related papers (2025-10-28T16:55:22Z) - Scaling Laws for Robust Comparison of Open Foundation Language-Vision Models and Datasets [5.8465717270452195]
We show how scaling law derivation can be used for model and dataset comparison.<n>For the first time, full scaling laws are derived for two important language-vision learning procedures, CLIP and MaMMUT.<n>We show that comparison can also be performed when deriving scaling laws with a constant learning rate schedule.
arXiv Detail & Related papers (2025-06-05T03:35:59Z) - Bayesian scaling laws for in-context learning [85.34114399339741]
In-context learning (ICL) is a powerful technique for getting language models to perform complex tasks with no training updates.<n>We show that ICL approximates a Bayesian learner, which gives rise to a novel Bayesian scaling law for ICL.<n>Our scaling law matches existing scaling laws in accuracy while also offering interpretable terms for task priors, learning efficiency, and per-example probabilities.
arXiv Detail & Related papers (2024-10-21T21:45:22Z) - LLM-Generated Natural Language Meets Scaling Laws: New Explorations and Data Augmentation Methods [3.333401582174629]
We introduce the scaling laws to intrinsically calculate large language models (LLMNL) and human natural language (HNL)
Through experiments, we reveal slight deviations from Mandelbrot's law in LLMNL, underscore a complexity advantage in HNL, and supplement an interpretive discussion on language style.
We introduce a novel data augmentation method for few-shot text classification, termed ZGPTDA, which leverages fuzzy computing mechanisms driven by the conformity to scaling laws.
arXiv Detail & Related papers (2024-06-29T05:40:17Z) - Code Needs Comments: Enhancing Code LLMs with Comment Augmentation [91.52444946362547]
We introduce a novel data augmentation method that generates comments for existing code, coupled with a data filtering strategy that filters out code data poorly correlated with natural language.
We conducted experiments on three code-focused Large Language Models and observed consistent improvements in performance on two widely-used programming skill benchmarks.
arXiv Detail & Related papers (2024-02-20T13:56:38Z) - Scaling Laws Behind Code Understanding Model [4.846512516189021]
We study the scaling law for the code understanding task by varying training data, model size, and computing resource.
We train a large-scale code understanding model named CoLSBERT with 1.5B parameters on a large dataset using more computing resource, which outperforms previous work by a large margin.
arXiv Detail & Related papers (2024-02-20T08:31:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.