Learning Curve Theory
- URL: http://arxiv.org/abs/2102.04074v1
- Date: Mon, 8 Feb 2021 09:25:31 GMT
- Title: Learning Curve Theory
- Authors: Marcus Hutter
- Abstract summary: Scaling laws' refers to power-law decreases of training or test error w.r.t. more data, larger neural networks, and/or more compute.
We develop and theoretically analyse the simplest possible (toy) model that can exhibit $n-beta$ learning curves for arbitrary power $beta>0$.
- Score: 21.574781022415365
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently a number of empirical "universal" scaling law papers have been
published, most notably by OpenAI. `Scaling laws' refers to power-law decreases
of training or test error w.r.t. more data, larger neural networks, and/or more
compute. In this work we focus on scaling w.r.t. data size $n$. Theoretical
understanding of this phenomenon is largely lacking, except in
finite-dimensional models for which error typically decreases with $n^{-1/2}$
or $n^{-1}$, where $n$ is the sample size. We develop and theoretically analyse
the simplest possible (toy) model that can exhibit $n^{-\beta}$ learning curves
for arbitrary power $\beta>0$, and determine whether power laws are universal
or depend on the data distribution.
Related papers
- Scaling Laws in Linear Regression: Compute, Parameters, and Data [86.48154162485712]
We study the theory of scaling laws in an infinite dimensional linear regression setup.
We show that the reducible part of the test error is $Theta(-(a-1) + N-(a-1)/a)$.
Our theory is consistent with the empirical neural scaling laws and verified by numerical simulation.
arXiv Detail & Related papers (2024-06-12T17:53:29Z) - Language models scale reliably with over-training and on downstream tasks [121.69867718185125]
Scaling laws are useful guides for derisking expensive training runs.
However, there remain gaps between current studies and how language models are trained.
In contrast, scaling laws mostly predict loss on inference, but models are usually compared on downstream task performance.
arXiv Detail & Related papers (2024-03-13T13:54:00Z) - A Neural Scaling Law from Lottery Ticket Ensembling [19.937894875216507]
Sharma & Kaplan predicted that MSE losses decay as $N-alpha$, $alpha=4/d$, where $N$ is the number of model parameters, and $d$ is the intrinsic input dimension.
We find that a simple 1D problem manifests a different scaling law ($alpha=1$) from their predictions.
arXiv Detail & Related papers (2023-10-03T17:58:33Z) - Effective Minkowski Dimension of Deep Nonparametric Regression: Function
Approximation and Statistical Theories [70.90012822736988]
Existing theories on deep nonparametric regression have shown that when the input data lie on a low-dimensional manifold, deep neural networks can adapt to intrinsic data structures.
This paper introduces a relaxed assumption that input data are concentrated around a subset of $mathbbRd$ denoted by $mathcalS$, and the intrinsic dimension $mathcalS$ can be characterized by a new complexity notation -- effective Minkowski dimension.
arXiv Detail & Related papers (2023-06-26T17:13:31Z) - Neural Implicit Manifold Learning for Topology-Aware Density Estimation [15.878635603835063]
Current generative models learn $mathcalM$ by mapping an $m$-dimensional latent variable through a neural network.
We show that our model can learn manifold-supported distributions with complex topologies more accurately than pushforward models.
arXiv Detail & Related papers (2022-06-22T18:00:00Z) - $p$-Generalized Probit Regression and Scalable Maximum Likelihood
Estimation via Sketching and Coresets [74.37849422071206]
We study the $p$-generalized probit regression model, which is a generalized linear model for binary responses.
We show how the maximum likelihood estimator for $p$-generalized probit regression can be approximated efficiently up to a factor of $(1+varepsilon)$ on large data.
arXiv Detail & Related papers (2022-03-25T10:54:41Z) - Locality defeats the curse of dimensionality in convolutional
teacher-student scenarios [69.2027612631023]
We show that locality is key in determining the learning curve exponent $beta$.
We conclude by proving, using a natural assumption, that performing kernel regression with a ridge that decreases with the size of the training set leads to similar learning curve exponents to those we obtain in the ridgeless case.
arXiv Detail & Related papers (2021-06-16T08:27:31Z) - Estimating Stochastic Linear Combination of Non-linear Regressions
Efficiently and Scalably [23.372021234032363]
We show that when the sub-sample sizes are large then the estimation errors will be sacrificed by too much.
To the best of our knowledge, this is the first work that and guarantees for the lineartext+Stochasticity model.
arXiv Detail & Related papers (2020-10-19T07:15:38Z) - The Information Bottleneck Problem and Its Applications in Machine
Learning [53.57797720793437]
Inference capabilities of machine learning systems skyrocketed in recent years, now playing a pivotal role in various aspect of society.
The information bottleneck (IB) theory emerged as a bold information-theoretic paradigm for analyzing deep learning (DL) systems.
In this tutorial we survey the information-theoretic origins of this abstract principle, and its recent impact on DL.
arXiv Detail & Related papers (2020-04-30T16:48:51Z) - A Neural Scaling Law from the Dimension of the Data Manifold [8.656787568717252]
When data is plentiful, the loss achieved by well-trained neural networks scales as a power-law $L propto N-alpha$ in the number of network parameters $N$.
The scaling law can be explained if neural models are effectively just performing regression on a data manifold of intrinsic dimension $d$.
This simple theory predicts that the scaling exponents $alpha approx 4/d$ for cross-entropy and mean-squared error losses.
arXiv Detail & Related papers (2020-04-22T19:16:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.