Two Phases of Scaling Laws for Nearest Neighbor Classifiers
- URL: http://arxiv.org/abs/2308.08247v1
- Date: Wed, 16 Aug 2023 09:28:55 GMT
- Title: Two Phases of Scaling Laws for Nearest Neighbor Classifiers
- Authors: Pengkun Yang, Jingzhao Zhang
- Abstract summary: A fast scaling law implies that one can solve machine learning problems by simply boosting the data and the model sizes.
We show that a scaling law can have two phases: in the first phase, the generalization error depends exponentially on the data dimension and decreases fast.
- Score: 18.93620861346151
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: A scaling law refers to the observation that the test performance of a model
improves as the number of training data increases. A fast scaling law implies
that one can solve machine learning problems by simply boosting the data and
the model sizes. Yet, in many cases, the benefit of adding more data can be
negligible. In this work, we study the rate of scaling laws of nearest neighbor
classifiers. We show that a scaling law can have two phases: in the first
phase, the generalization error depends polynomially on the data dimension and
decreases fast; whereas in the second phase, the error depends exponentially on
the data dimension and decreases slowly. Our analysis highlights the complexity
of the data distribution in determining the generalization error. When the data
distributes benignly, our result suggests that nearest neighbor classifier can
achieve a generalization error that depends polynomially, instead of
exponentially, on the data dimension.
Related papers
- Compute-Optimal LLMs Provably Generalize Better With Scale [102.29926217670926]
We develop generalization bounds on the pretraining objective of large language models (LLMs) in the compute-optimal regime.
We introduce a novel, fully empirical Freedman-type martingale concentration that tightens existing bounds by accounting for the variance of the loss function.
We produce a scaling law for the generalization gap, with bounds that become predictably stronger with scale.
arXiv Detail & Related papers (2025-04-21T16:26:56Z) - A statistical theory of overfitting for imbalanced classification [0.6144680854063939]
We develop a statistical theory for high-dimensional imbalanced classification.<n>We find that dimensionality induces truncation or skewing effects on the logit distribution.<n>This phenomenon explains why the minority class is more severely affected by overfitting.
arXiv Detail & Related papers (2025-02-17T00:21:33Z) - Understanding Scaling Laws with Statistical and Approximation Theory for Transformer Neural Networks on Intrinsically Low-dimensional Data [4.481230230086981]
In deep neural networks, a model's generalization error is often observed to follow a power scaling law dependent both on the model size and the data size.
We show that our theory predicts a power law between the generalization error and both the training data size and the network size for transformers.
By leveraging low-dimensional data structures under a manifold hypothesis, we are able to explain transformer scaling laws in a way which respects the data geometry.
arXiv Detail & Related papers (2024-11-11T01:05:28Z) - Information-Theoretic Foundations for Neural Scaling Laws [20.617552198581024]
We develop information-theoretic foundations for neural scaling laws.
We observe that the optimal relation between data and model size is linear, up to logarithmic factors.
arXiv Detail & Related papers (2024-06-28T02:20:54Z) - Scaling Laws in Linear Regression: Compute, Parameters, and Data [86.48154162485712]
We study the theory of scaling laws in an infinite dimensional linear regression setup.
We show that the reducible part of the test error is $Theta(-(a-1) + N-(a-1)/a)$.
Our theory is consistent with the empirical neural scaling laws and verified by numerical simulation.
arXiv Detail & Related papers (2024-06-12T17:53:29Z) - Scaling Laws for the Value of Individual Data Points in Machine Learning [55.596413470429475]
We introduce a new perspective by investigating scaling behavior for the value of individual data points.
We provide learning theory to support our scaling law, and we observe empirically that it holds across diverse model classes.
Our work represents a first step towards understanding and utilizing scaling properties for the value of individual data points.
arXiv Detail & Related papers (2024-05-30T20:10:24Z) - Mean Estimation with User-level Privacy under Data Heterogeneity [54.07947274508013]
Different users may possess vastly different numbers of data points.
It cannot be assumed that all users sample from the same underlying distribution.
We propose a simple model of heterogeneous user data that allows user data to differ in both distribution and quantity of data.
arXiv Detail & Related papers (2023-07-28T23:02:39Z) - How Informative is the Approximation Error from Tensor Decomposition for
Neural Network Compression? [7.358732518242147]
Recent work assumes the approximation error on the weights is a proxy for the performance of the model to compress multiple layers and fine-tune the compressed model.
We perform an experimental study to test if this assumption holds across different layers and types of decompositions, and what the effect of fine-tuning is.
We find the approximation error on the weights has a positive correlation with the performance error, before as well as after fine-tuning.
arXiv Detail & Related papers (2023-05-09T10:12:26Z) - Variance estimation in graphs with the fused lasso [7.732474038706013]
We develop a linear time estimator for the homoscedastic case that can consistently estimate the variance in general graphs.
We show that our estimator attains minimax rates for the chain and 2D grid graphs when the mean signal has total variation with canonical scaling.
arXiv Detail & Related papers (2022-07-26T03:50:51Z) - Precise Learning Curves and Higher-Order Scaling Limits for Dot Product
Kernel Regression [41.48538038768993]
We focus on the problem of kernel ridge regression for dot-product kernels.
We observe a peak in the learning curve whenever $m approx dr/r!$ for any integer $r$, leading to multiple sample-wise descent and nontrivial behavior at multiple scales.
arXiv Detail & Related papers (2022-05-30T04:21:31Z) - Shape complexity in cluster analysis [0.0]
In cluster analysis, a common first step is to scale the data aiming to better partition them into clusters.
Here we explore the use of multidimensional shapes of data, aiming to obtain scaling factors for use prior to clustering.
We give results on some iconic data sets, highlighting the strengths and potential weaknesses of the new approach.
arXiv Detail & Related papers (2022-05-17T01:33:15Z) - Data Scaling Laws in NMT: The Effect of Noise and Architecture [59.767899982937756]
We study the effect of varying the architecture and training data quality on the data scaling properties of Neural Machine Translation (NMT)
We find that the data scaling exponents are minimally impacted, suggesting that marginally worse architectures or training data can be compensated for by adding more data.
arXiv Detail & Related papers (2022-02-04T06:53:49Z) - Toeplitz Least Squares Problems, Fast Algorithms and Big Data [1.3535770763481905]
Two recent algorithms have applied randomized numerical linear algebra techniques to fitting an autoregressive model to big time-series data.
We investigate and compare the quality of these two approximation algorithms on large-scale synthetic and real-world data.
While both algorithms display comparable results for synthetic datasets, the LSAR algorithm appears to be more robust when applied to real-world time series data.
arXiv Detail & Related papers (2021-12-24T08:32:09Z) - Information-Theoretic Generalization Bounds for Iterative
Semi-Supervised Learning [81.1071978288003]
In particular, we seek to understand the behaviour of the em generalization error of iterative SSL algorithms using information-theoretic principles.
Our theoretical results suggest that when the class conditional variances are not too large, the upper bound on the generalization error decreases monotonically with the number of iterations, but quickly saturates.
arXiv Detail & Related papers (2021-10-03T05:38:49Z) - SreaMRAK a Streaming Multi-Resolution Adaptive Kernel Algorithm [60.61943386819384]
Existing implementations of KRR require that all the data is stored in the main memory.
We propose StreaMRAK - a streaming version of KRR.
We present a showcase study on two synthetic problems and the prediction of the trajectory of a double pendulum.
arXiv Detail & Related papers (2021-08-23T21:03:09Z) - Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss.
We examine how these benign overfitting phenomena occur in a two-layer neural network setting.
We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z) - Evaluating representations by the complexity of learning low-loss
predictors [55.94170724668857]
We consider the problem of evaluating representations of data for use in solving a downstream task.
We propose to measure the quality of a representation by the complexity of learning a predictor on top of the representation that achieves low loss on a task of interest.
arXiv Detail & Related papers (2020-09-15T22:06:58Z) - Good Classifiers are Abundant in the Interpolating Regime [64.72044662855612]
We develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers.
We find that test errors tend to concentrate around a small typical value $varepsilon*$, which deviates substantially from the test error of worst-case interpolating model.
Our results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice.
arXiv Detail & Related papers (2020-06-22T21:12:31Z) - Matrix Completion with Quantified Uncertainty through Low Rank Gaussian
Copula [30.84155327760468]
This paper proposes a framework for missing value imputation with quantified uncertainty.
The time required to fit the model scales linearly with the number of rows and the number of columns in the dataset.
Empirical results show the method yields state-of-the-art imputation accuracy across a wide range of data types.
arXiv Detail & Related papers (2020-06-18T19:51:42Z) - Interpolation and Learning with Scale Dependent Kernels [91.41836461193488]
We study the learning properties of nonparametric ridge-less least squares.
We consider the common case of estimators defined by scale dependent kernels.
arXiv Detail & Related papers (2020-06-17T16:43:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.