Fugu-MT 論文翻訳(概要): Sharp feature-learning transitions and Bayes-optimal neural scaling laws in extensive-width networks

論文の概要: Sharp feature-learning transitions and Bayes-optimal neural scaling laws in extensive-width networks

arxiv url: http://arxiv.org/abs/2605.10395v1
Date: Mon, 11 May 2026 11:39:03 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-12 23:28:50.772732
Title: Sharp feature-learning transitions and Bayes-optimal neural scaling laws in extensive-width networks
Title（参考訳）: 広帯域ネットワークにおけるシャープ特徴学習遷移とベイズ最適ニューラルスケーリング則
Authors: Minh-Toan Nguyen, Jean Barbier,
Abstract要約: 雑音の多い質問から階層的な特徴を持つ一層教師ネットワークを学習する際の情報理論的限界について検討する。有効幅$k_c$付近でtextscAdam を訓練した学生が,これらの最適スケーリング法則を実現することを示す。
参考スコア（独自算出の注目度）: 8.250374560598493
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study the information-theoretic limits of learning a one-hidden-layer teacher network with hierarchical features from noisy queries, in the context of knowledge transfer to a smaller student model. We work in the high-dimensional regime where the teacher width $k$ scales linearly with the input dimension $d$ -- a setting that captures large-but-finite-width networks and has only recently become analytically tractable. Using a heuristic leave-one-out decoupling argument, validated numerically throughout, we derive asymptotically sharp characterizations of the Bayes-optimal generalization error and individual feature overlaps via a system of closed fixed-point equations. These equations reveal that feature learnability is governed by a sequence of sharp phase transitions: as data grows, teacher features become recoverable sequentially, each through a discontinuous jump in overlap. This sequential acquisition underlies a precise notion of \textit{effective width} $k_c$ -- the number of learnable features at a given data budget $n$ -- which unifies two distinct scaling regimes: a feature-learning regime in which the Bayes-optimal generalization error $\varepsilon^{\rm BO}$ scales as $ n^{1/(2β)-1}$, and a refinement regime in which it scales as $n^{-1}$, where $β>1/2$ is the exponent of the power-law feature hierarchy. Both laws collapse to the single relation $\varepsilon^{\rm BO}=Θ(k_c d/n)$. We further show empirically that a student trained with \textsc{Adam} near the effective width $k_c$ achieves these optimal scaling laws (up to a small algorithmic gap), and provide an information-theoretic account of the associated scaling in model size.
Abstract（参考訳）: 本研究では,より小さな学生モデルへの知識伝達の文脈において,雑音の多い質問から階層的な特徴を持つ一層教師ネットワークを学習する際の情報理論的限界について検討する。私たちは、教師の幅が$k$で入力次元が$d$で線形にスケールする高次元の環境で働きます。ベイズ最適一般化誤差と個々の特徴重なりの漸近的急激な特徴づけを、閉じた不動点方程式の系によって導き出す。これらの方程式は、特徴学習性は鋭い相転移の連続によって支配されることを示している:データが大きくなるにつれて、教師の特徴は連続的に回復し、それぞれが連続しないジャンプを重なり合う。このシーケンシャルな買収は、与えられたデータ予算における学習可能な機能の数である$k_c$ -- を正確に定義している。これは、2つの異なるスケーリングレギュレーションを統一するものである: ベイズ最適化の一般化誤差を持つ特徴学習レギュレーション $\varepsilon^{\rm BO}$ scales as $n^{1/(2β)-1}$、そして$β>1/2$はパワーロー特徴階層の指数である。両方の法則は、単一の関係 $\varepsilon^{\rm BO} = (k_c d/n)$ に崩壊する。さらに,有効幅$k_c$付近で学習した学生が,これらの最適スケーリング法則(アルゴリズムのギャップが小さくなるまで)を達成し,関連するスケーリングのモデルサイズに関する情報理論的な説明を提供することを実証的に示す。

論文の概要: Sharp feature-learning transitions and Bayes-optimal neural scaling laws in extensive-width networks

関連論文リスト