Fugu-MT 論文翻訳(概要): Scaling Laws for Code: A More Data-Hungry Regime

論文の概要: Scaling Laws for Code: A More Data-Hungry Regime

arxiv url: http://arxiv.org/abs/2510.08702v1
Date: Thu, 09 Oct 2025 18:05:52 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-14 00:38:47.400091
Title: Scaling Laws for Code: A More Data-Hungry Regime
Title（参考訳）: コードのスケーリング法則: よりデータ不足のレジーム
Authors: Xianzhen Luo, Wenzhen Zheng, Qingfu Zhu, Rongyi Zhang, Houyi Li, Siming Huang, YuanTao Fan, Wanxiang Che,
Abstract要約: 効率的な訓練を導くスケーリング法は、主に自然言語(NL)に基づいて分析されるコードのスケーリング法則に関する大規模な実証的研究は,0.2Bから3.8Bまでのモデルサイズ117回,2Bから128Bまでのトレーニングトークンからなる。
参考スコア（独自算出の注目度）: 43.20725601738161
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Code Large Language Models (LLMs) are revolutionizing software engineering. However, scaling laws that guide the efficient training are predominantly analyzed on Natural Language (NL). Given the fundamental differences like strict syntax between code and NL, it is unclear whether these laws are directly applicable to code. To address this gap, we conduct the first large-scale empirical study of scaling laws for code, comprising 117 experimental runs with model sizes from 0.2B to 3.8B and training tokens from 2B to 128B. We fit the Chinchilla law and the Farsser law. First, the results show that the more expressive Farseer law offers greater accuracy. Second, the analysis reveals that Code LLMs scale effectively with model size. Crucially, code represents a more data-hungry regime, requiring a substantially higher data-to-parameter ratio than NL. Finally, two additional sets of experiments on code-NL mixtures show that NL benefits resource-constrained scenarios, but becomes a detriment at higher compute budgets.
Abstract（参考訳）: コード大言語モデル(LLM)は、ソフトウェア工学に革命をもたらしている。しかしながら、効率的なトレーニングを導くスケーリング法は、主に自然言語(NL)に基づいて分析される。コードとNLの厳密な構文のような根本的な違いを考えると、これらの法則がコードに直接適用できるかどうかは不明である。このギャップに対処するため,我々は,0.2Bから3.8Bまでのモデルサイズ117回の実験と,2Bから128Bまでのトレーニングトークンを含む,コードのスケーリング法則に関する大規模な実証的研究を行った。我々はチンチラ法とファーサー法に適合する。まず、より表現力のあるFarseer法がより正確であることを示す。 2つ目は、コードLLMがモデルサイズで効果的にスケールできることである。重要なことは、コードはNLよりもはるかに高いデータ-パラメータ比を必要とする、よりデータ-ハングリーな状態を表している。最後に、コード-NL混合実験の2つの追加実験は、NLがリソース制約されたシナリオの恩恵を受けるが、より高い計算予算で有害となることを示している。

論文の概要: Scaling Laws for Code: A More Data-Hungry Regime

関連論文リスト