Two Phases of Scaling Laws for Nearest Neighbor Classifiers
- URL: http://arxiv.org/abs/2308.08247v1
- Date: Wed, 16 Aug 2023 09:28:55 GMT
- Title: Two Phases of Scaling Laws for Nearest Neighbor Classifiers
- Authors: Pengkun Yang, Jingzhao Zhang
- Abstract summary: A fast scaling law implies that one can solve machine learning problems by simply boosting the data and the model sizes.
We show that a scaling law can have two phases: in the first phase, the generalization error depends exponentially on the data dimension and decreases fast.
- Score: 18.93620861346151
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: A scaling law refers to the observation that the test performance of a model
improves as the number of training data increases. A fast scaling law implies
that one can solve machine learning problems by simply boosting the data and
the model sizes. Yet, in many cases, the benefit of adding more data can be
negligible. In this work, we study the rate of scaling laws of nearest neighbor
classifiers. We show that a scaling law can have two phases: in the first
phase, the generalization error depends polynomially on the data dimension and
decreases fast; whereas in the second phase, the error depends exponentially on
the data dimension and decreases slowly. Our analysis highlights the complexity
of the data distribution in determining the generalization error. When the data
distributes benignly, our result suggests that nearest neighbor classifier can
achieve a generalization error that depends polynomially, instead of
exponentially, on the data dimension.
Related papers
- Compute-Optimal LLMs Provably Generalize Better With Scale [102.29926217670926]
We develop generalization bounds on the pretraining objective of large language models (LLMs) in the compute-optimal regime.
We introduce a novel, fully empirical Freedman-type martingale concentration that tightens existing bounds by accounting for the variance of the loss function.
We produce a scaling law for the generalization gap, with bounds that become predictably stronger with scale.
arXiv Detail & Related papers (2025-04-21T16:26:56Z) - Understanding Scaling Laws with Statistical and Approximation Theory for Transformer Neural Networks on Intrinsically Low-dimensional Data [4.481230230086981]
In deep neural networks, a model's generalization error is often observed to follow a power scaling law dependent both on the model size and the data size.
We show that our theory predicts a power law between the generalization error and both the training data size and the network size for transformers.
By leveraging low-dimensional data structures under a manifold hypothesis, we are able to explain transformer scaling laws in a way which respects the data geometry.
arXiv Detail & Related papers (2024-11-11T01:05:28Z) - Information-Theoretic Foundations for Neural Scaling Laws [20.617552198581024]
We develop information-theoretic foundations for neural scaling laws.
We observe that the optimal relation between data and model size is linear, up to logarithmic factors.
arXiv Detail & Related papers (2024-06-28T02:20:54Z) - Scaling Laws in Linear Regression: Compute, Parameters, and Data [86.48154162485712]
We study the theory of scaling laws in an infinite dimensional linear regression setup.
We show that the reducible part of the test error is $Theta(-(a-1) + N-(a-1)/a)$.
Our theory is consistent with the empirical neural scaling laws and verified by numerical simulation.
arXiv Detail & Related papers (2024-06-12T17:53:29Z) - Scaling Laws for the Value of Individual Data Points in Machine Learning [55.596413470429475]
We introduce a new perspective by investigating scaling behavior for the value of individual data points.
We provide learning theory to support our scaling law, and we observe empirically that it holds across diverse model classes.
Our work represents a first step towards understanding and utilizing scaling properties for the value of individual data points.
arXiv Detail & Related papers (2024-05-30T20:10:24Z) - Precise Learning Curves and Higher-Order Scaling Limits for Dot Product
Kernel Regression [41.48538038768993]
We focus on the problem of kernel ridge regression for dot-product kernels.
We observe a peak in the learning curve whenever $m approx dr/r!$ for any integer $r$, leading to multiple sample-wise descent and nontrivial behavior at multiple scales.
arXiv Detail & Related papers (2022-05-30T04:21:31Z) - Shape complexity in cluster analysis [0.0]
In cluster analysis, a common first step is to scale the data aiming to better partition them into clusters.
Here we explore the use of multidimensional shapes of data, aiming to obtain scaling factors for use prior to clustering.
We give results on some iconic data sets, highlighting the strengths and potential weaknesses of the new approach.
arXiv Detail & Related papers (2022-05-17T01:33:15Z) - Data Scaling Laws in NMT: The Effect of Noise and Architecture [59.767899982937756]
We study the effect of varying the architecture and training data quality on the data scaling properties of Neural Machine Translation (NMT)
We find that the data scaling exponents are minimally impacted, suggesting that marginally worse architectures or training data can be compensated for by adding more data.
arXiv Detail & Related papers (2022-02-04T06:53:49Z) - Toeplitz Least Squares Problems, Fast Algorithms and Big Data [1.3535770763481905]
Two recent algorithms have applied randomized numerical linear algebra techniques to fitting an autoregressive model to big time-series data.
We investigate and compare the quality of these two approximation algorithms on large-scale synthetic and real-world data.
While both algorithms display comparable results for synthetic datasets, the LSAR algorithm appears to be more robust when applied to real-world time series data.
arXiv Detail & Related papers (2021-12-24T08:32:09Z) - Information-Theoretic Generalization Bounds for Iterative
Semi-Supervised Learning [81.1071978288003]
In particular, we seek to understand the behaviour of the em generalization error of iterative SSL algorithms using information-theoretic principles.
Our theoretical results suggest that when the class conditional variances are not too large, the upper bound on the generalization error decreases monotonically with the number of iterations, but quickly saturates.
arXiv Detail & Related papers (2021-10-03T05:38:49Z) - SreaMRAK a Streaming Multi-Resolution Adaptive Kernel Algorithm [60.61943386819384]
Existing implementations of KRR require that all the data is stored in the main memory.
We propose StreaMRAK - a streaming version of KRR.
We present a showcase study on two synthetic problems and the prediction of the trajectory of a double pendulum.
arXiv Detail & Related papers (2021-08-23T21:03:09Z) - Interpolation and Learning with Scale Dependent Kernels [91.41836461193488]
We study the learning properties of nonparametric ridge-less least squares.
We consider the common case of estimators defined by scale dependent kernels.
arXiv Detail & Related papers (2020-06-17T16:43:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.