Related papers: Learning Curve Theory

Learning Curve Theory

URL: http://arxiv.org/abs/2102.04074v1
Date: Mon, 8 Feb 2021 09:25:31 GMT
Title: Learning Curve Theory
Authors: Marcus Hutter
Abstract summary: Scaling laws' refers to power-law decreases of training or test error w.r.t. more data, larger neural networks, and/or more compute. We develop and theoretically analyse the simplest possible (toy) model that can exhibit $n-beta$ learning curves for arbitrary power $beta>0$.
Score: 21.574781022415365
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recently a number of empirical "universal" scaling law papers have been published, most notably by OpenAI. `Scaling laws' refers to power-law decreases of training or test error w.r.t. more data, larger neural networks, and/or more compute. In this work we focus on scaling w.r.t. data size $n$. Theoretical understanding of this phenomenon is largely lacking, except in finite-dimensional models for which error typically decreases with $n^{-1/2}$ or $n^{-1}$, where $n$ is the sample size. We develop and theoretically analyse the simplest possible (toy) model that can exhibit $n^{-\beta}$ learning curves for arbitrary power $\beta>0$, and determine whether power laws are universal or depend on the data distribution.

Related papers

Computational-Statistical Tradeoffs at the Next-Token Prediction Barrier: Autoregressive and Imitation Learning under Misspecification [50.717692060500696]
Next-token prediction with the logarithmic loss is a cornerstone of autoregressive sequence modeling. Next-token prediction can be made robust so as to achieve $C=tilde O(H)$, representing moderate error amplification. No computationally efficient algorithm can achieve sub-polynomial approximation factor $C=e(log H)1-Omega(1)$.
arXiv Detail & Related papers (2025-02-18T02:52:00Z)
Scaling Laws in Linear Regression: Compute, Parameters, and Data [86.48154162485712]
We study the theory of scaling laws in an infinite dimensional linear regression setup. We show that the reducible part of the test error is $Theta(-(a-1) + N-(a-1)/a)$. Our theory is consistent with the empirical neural scaling laws and verified by numerical simulation.
arXiv Detail & Related papers (2024-06-12T17:53:29Z)
Language models scale reliably with over-training and on downstream tasks [121.69867718185125]
Scaling laws are useful guides for derisking expensive training runs. However, there remain gaps between current studies and how language models are trained. In contrast, scaling laws mostly predict loss on inference, but models are usually compared on downstream task performance.
arXiv Detail & Related papers (2024-03-13T13:54:00Z)
A Neural Scaling Law from Lottery Ticket Ensembling [19.937894875216507]
Sharma & Kaplan predicted that MSE losses decay as $N-alpha$, $alpha=4/d$, where $N$ is the number of model parameters, and $d$ is the intrinsic input dimension. We find that a simple 1D problem manifests a different scaling law ($alpha=1$) from their predictions.
arXiv Detail & Related papers (2023-10-03T17:58:33Z)
Effective Minkowski Dimension of Deep Nonparametric Regression: Function Approximation and Statistical Theories [70.90012822736988]
Existing theories on deep nonparametric regression have shown that when the input data lie on a low-dimensional manifold, deep neural networks can adapt to intrinsic data structures. This paper introduces a relaxed assumption that input data are concentrated around a subset of $mathbbRd$ denoted by $mathcalS$, and the intrinsic dimension $mathcalS$ can be characterized by a new complexity notation -- effective Minkowski dimension.
arXiv Detail & Related papers (2023-06-26T17:13:31Z)
Neural Implicit Manifold Learning for Topology-Aware Density Estimation [15.878635603835063]
Current generative models learn $mathcalM$ by mapping an $m$-dimensional latent variable through a neural network. We show that our model can learn manifold-supported distributions with complex topologies more accurately than pushforward models.
arXiv Detail & Related papers (2022-06-22T18:00:00Z)
$p$-Generalized Probit Regression and Scalable Maximum Likelihood Estimation via Sketching and Coresets [74.37849422071206]
We study the $p$-generalized probit regression model, which is a generalized linear model for binary responses. We show how the maximum likelihood estimator for $p$-generalized probit regression can be approximated efficiently up to a factor of $(1+varepsilon)$ on large data.
arXiv Detail & Related papers (2022-03-25T10:54:41Z)
Locality defeats the curse of dimensionality in convolutional teacher-student scenarios [69.2027612631023]
We show that locality is key in determining the learning curve exponent $beta$. We conclude by proving, using a natural assumption, that performing kernel regression with a ridge that decreases with the size of the training set leads to similar learning curve exponents to those we obtain in the ridgeless case.
arXiv Detail & Related papers (2021-06-16T08:27:31Z)
Estimating Stochastic Linear Combination of Non-linear Regressions Efficiently and Scalably [23.372021234032363]
We show that when the sub-sample sizes are large then the estimation errors will be sacrificed by too much. To the best of our knowledge, this is the first work that and guarantees for the lineartext+Stochasticity model.
arXiv Detail & Related papers (2020-10-19T07:15:38Z)
The Information Bottleneck Problem and Its Applications in Machine Learning [53.57797720793437]
Inference capabilities of machine learning systems skyrocketed in recent years, now playing a pivotal role in various aspect of society. The information bottleneck (IB) theory emerged as a bold information-theoretic paradigm for analyzing deep learning (DL) systems. In this tutorial we survey the information-theoretic origins of this abstract principle, and its recent impact on DL.
arXiv Detail & Related papers (2020-04-30T16:48:51Z)
A Neural Scaling Law from the Dimension of the Data Manifold [8.656787568717252]
When data is plentiful, the loss achieved by well-trained neural networks scales as a power-law $L propto N-alpha$ in the number of network parameters $N$. The scaling law can be explained if neural models are effectively just performing regression on a data manifold of intrinsic dimension $d$. This simple theory predicts that the scaling exponents $alpha approx 4/d$ for cross-entropy and mean-squared error losses.
arXiv Detail & Related papers (2020-04-22T19:16:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.