Related papers: A universal compression theory: Lottery ticket hypothesis and superpolynomial scaling laws

A universal compression theory: Lottery ticket hypothesis and superpolynomial scaling laws

URL: http://arxiv.org/abs/2510.00504v1
Date: Wed, 01 Oct 2025 04:35:23 GMT
Title: A universal compression theory: Lottery ticket hypothesis and superpolynomial scaling laws
Authors: Hong-Yi Wang, Di Luo, Tomaso Poggio, Isaac L. Chuang, Liu Ziyin,
Abstract summary: We show that a neural network can be compressed to polylogarithmic width while preserving its learning dynamics.<n>We also show that a large dataset can be compressed to polylogarithmic size while leaving the loss landscape of the corresponding model unchanged.
Score: 16.542320113452423
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: When training large-scale models, the performance typically scales with the number of parameters and the dataset size according to a slow power law. A fundamental theoretical and practical question is whether comparable performance can be achieved with significantly smaller models and substantially less data. In this work, we provide a positive and constructive answer. We prove that a generic permutation-invariant function of $d$ objects can be asymptotically compressed into a function of $\operatorname{polylog} d$ objects with vanishing error. This theorem yields two key implications: (Ia) a large neural network can be compressed to polylogarithmic width while preserving its learning dynamics; (Ib) a large dataset can be compressed to polylogarithmic size while leaving the loss landscape of the corresponding model unchanged. (Ia) directly establishes a proof of the \textit{dynamical} lottery ticket hypothesis, which states that any ordinary network can be strongly compressed such that the learning dynamics and result remain unchanged. (Ib) shows that a neural scaling law of the form $L\sim d^{-\alpha}$ can be boosted to an arbitrarily fast power law decay, and ultimately to $\exp(-\alpha' \sqrt[m]{d})$.

Related papers

Unified Scaling Laws for Compressed Representations [69.72517034565467]
We investigate whether a unified scaling framework can accurately predict model performance when training occurs over various compressed representations.<n>Our main finding is demonstrating both theoretically and empirically that there exists a simple "capacity" metric.<n>We extend our formulation to directly compare the accuracy potential of different compressed formats, and to derive better algorithms for training over sparse-quantized formats.
arXiv Detail & Related papers (2025-06-02T16:52:51Z)
Computational-Statistical Tradeoffs at the Next-Token Prediction Barrier: Autoregressive and Imitation Learning under Misspecification [50.717692060500696]
Next-token prediction with the logarithmic loss is a cornerstone of autoregressive sequence modeling.<n>Next-token prediction can be made robust so as to achieve $C=tilde O(H)$, representing moderate error amplification.<n>No computationally efficient algorithm can achieve sub-polynomial approximation factor $C=e(log H)1-Omega(1)$.
arXiv Detail & Related papers (2025-02-18T02:52:00Z)
Scaling Laws in Linear Regression: Compute, Parameters, and Data [86.48154162485712]
We study the theory of scaling laws in an infinite dimensional linear regression setup.<n>We show that the reducible part of the test error is $Theta(-(a-1) + N-(a-1)/a)$.<n>Our theory is consistent with the empirical neural scaling laws and verified by numerical simulation.
arXiv Detail & Related papers (2024-06-12T17:53:29Z)
Distribution learning via neural differential equations: a nonparametric statistical perspective [1.4436965372953483]
This work establishes the first general statistical convergence analysis for distribution learning via ODE models trained through likelihood transformations. We show that the latter can be quantified via the $C1$-metric entropy of the class $mathcal F$. We then apply this general framework to the setting of $Ck$-smooth target densities, and establish nearly minimax-optimal convergence rates for two relevant velocity field classes $mathcal F$: $Ck$ functions and neural networks.
arXiv Detail & Related papers (2023-09-03T00:21:37Z)
Effective Minkowski Dimension of Deep Nonparametric Regression: Function Approximation and Statistical Theories [70.90012822736988]
Existing theories on deep nonparametric regression have shown that when the input data lie on a low-dimensional manifold, deep neural networks can adapt to intrinsic data structures. This paper introduces a relaxed assumption that input data are concentrated around a subset of $mathbbRd$ denoted by $mathcalS$, and the intrinsic dimension $mathcalS$ can be characterized by a new complexity notation -- effective Minkowski dimension.
arXiv Detail & Related papers (2023-06-26T17:13:31Z)
Exponentially improved efficient machine learning for quantum many-body states with provable guarantees [0.0]
We provide theoretical guarantees for efficient learning of quantum many-body states and their properties, with model-independent applications. Our results provide theoretical guarantees for efficient learning of quantum many-body states and their properties, with model-independent applications.
arXiv Detail & Related papers (2023-04-10T02:22:36Z)
Statistically Meaningful Approximation: a Case Study on Approximating Turing Machines with Transformers [50.85524803885483]
This work proposes a formal definition of statistically meaningful (SM) approximation which requires the approximating network to exhibit good statistical learnability. We study SM approximation for two function classes: circuits and Turing machines.
arXiv Detail & Related papers (2021-07-28T04:28:55Z)
A Universal Law of Robustness via Isoperimetry [1.484852576248587]
We show that smooth requires $d$ more parameters than mere, where $d$ is the ambient data dimension. We prove this universal law of robustness for any smoothly parametrized function class with size weights.
arXiv Detail & Related papers (2021-05-26T19:49:47Z)
On Function Approximation in Reinforcement Learning: Optimism in the Face of Large State Spaces [208.67848059021915]
We study the exploration-exploitation tradeoff at the core of reinforcement learning. In particular, we prove that the complexity of the function class $mathcalF$ characterizes the complexity of the function. Our regret bounds are independent of the number of episodes.
arXiv Detail & Related papers (2020-11-09T18:32:22Z)
Hybrid Stochastic-Deterministic Minibatch Proximal Gradient: Less-Than-Single-Pass Optimization with Nearly Optimal Generalization [83.80460802169999]
We show that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model. For loss factors, we prove that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model.
arXiv Detail & Related papers (2020-09-18T02:18:44Z)
A Neural Scaling Law from the Dimension of the Data Manifold [8.656787568717252]
When data is plentiful, the loss achieved by well-trained neural networks scales as a power-law $L propto N-alpha$ in the number of network parameters $N$. The scaling law can be explained if neural models are effectively just performing regression on a data manifold of intrinsic dimension $d$. This simple theory predicts that the scaling exponents $alpha approx 4/d$ for cross-entropy and mean-squared error losses.
arXiv Detail & Related papers (2020-04-22T19:16:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.