Why ReLU? A Bit-Model Dichotomy for Deep Network Training
- URL: http://arxiv.org/abs/2602.19017v1
- Date: Sun, 22 Feb 2026 03:12:44 GMT
- Title: Why ReLU? A Bit-Model Dichotomy for Deep Network Training
- Authors: Ilan Doron-Arad, Elchanan Mossel,
- Abstract summary: We show that finite-precision constraints are not merely implementation details but fundamental determinants of learnability.<n>Our results demonstrate that finite-precision constraints are not merely implementation details but fundamental determinants of learnability.
- Score: 13.35672617666336
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Theoretical analyses of Empirical Risk Minimization (ERM) are standardly framed within the Real-RAM model of computation. In this setting, training even simple neural networks is known to be $\exists \mathbb{R}$-complete -- a complexity class believed to be harder than NP, that characterizes the difficulty of solving systems of polynomial inequalities over the real numbers. However, this algebraic framework diverges from the reality of digital computation with finite-precision hardware. In this work, we analyze the theoretical complexity of ERM under a realistic bit-level model ($\mathsf{ERM}_{\text{bit}}$), where network parameters and inputs are constrained to be rational numbers with polynomially bounded bit-lengths. Under this model, we reveal a sharp dichotomy in tractability governed by the network's activation function. We prove that for deep networks with {\em any} polynomial activations with rational coefficients and degree at least $2$, the bit-complexity of training is severe: deciding $\mathsf{ERM}_{\text{bit}}$ is $\#P$-Hard, hence believed to be strictly harder than NP-complete problems. Furthermore, we show that determining the sign of a single partial derivative of the empirical loss function is intractable (unlikely in BPP), and deciding a specific bit in the gradient is $\#P$-Hard. This provides a complexity-theoretic perspective for the phenomenon of exploding and vanishing gradients. In contrast, we show that for piecewise-linear activations such as ReLU, the precision requirements remain manageable: $\mathsf{ERM}_{\text{bit}}$ is contained within NP (specifically NP-complete), and standard backpropagation runs in polynomial time. Our results demonstrate that finite-precision constraints are not merely implementation details but fundamental determinants of learnability.
Related papers
- Parameterized Hardness of Zonotope Containment and Neural Network Verification [9.076330553662876]
We prove that deciding positivity of a function $fcolonmathbbRdtomathbbR$ computed by a 2-layer ReLU network is W[1]-hard when parameterized by $d$.<n>We also show that approximating the maximum within any multiplicative factor in 2-layer ReLU networks, computing the $L_p$-Lipschitz constant for $pin(0,infty)$ in 2-layer networks, and approximating the $L_p$-Lipschitz constant in 3-layer networks are NP
arXiv Detail & Related papers (2025-09-26T18:59:59Z) - Expressive Power of Deep Networks on Manifolds: Simultaneous Approximation [2.815765641180636]
We show that a constant-depth $mathrmReLUk-1$ network with bounded weights can approximate any function in the Sobolev space.<n>We also prove that our construction is nearly optimal by showing the required number of parameters matches up to a logarithmic factor.
arXiv Detail & Related papers (2025-09-11T11:28:20Z) - The Limits of Tractable Marginalization [23.716205079188608]
Marginalization -- summing a function over all assignments to a subset of its inputs -- is a fundamental computational problem.<n>We show that when there is an efficient real RAM performing virtual evidence marginalization for a function, then there are small circuits for that function's multilinear representation.<n>We conclude with a result, showing that whenever there is an efficient real RAM performing virtual evidence marginalization for a function, then there are small circuits for that function's multilinear representation.
arXiv Detail & Related papers (2025-04-17T07:54:56Z) - The Sample Complexity of Online Reinforcement Learning: A Multi-model Perspective [55.15192437680943]
We study the sample complexity of online reinforcement learning in the general setting of nonlinear dynamical systems with continuous state and action spaces.<n>Our algorithm achieves a policy regret of $mathcalO(N epsilon2 + mathrmln(m(epsilon)/epsilon2)$, where $epsilon$ is the time horizon.<n>In the special case where the dynamics are parametrized by a compact and real-valued set of parameters, we prove a policy regret of $mathcalO(sqrt
arXiv Detail & Related papers (2025-01-27T10:01:28Z) - Neural Networks and (Virtual) Extended Formulations [8.185918509343818]
We prove lower bounds on the size of neural networks that optimize over $P$.<n>We show that $mathrmxc(P)$ is a lower bound on the size of any monotone or input neural network that solves the linear optimization problem over $P$.
arXiv Detail & Related papers (2024-11-05T11:12:11Z) - Neural network learns low-dimensional polynomials with SGD near the information-theoretic limit [75.4661041626338]
We study the problem of gradient descent learning of a single-index target function $f_*(boldsymbolx) = textstylesigma_*left(langleboldsymbolx,boldsymbolthetarangleright)$<n>We prove that a two-layer neural network optimized by an SGD-based algorithm learns $f_*$ with a complexity that is not governed by information exponents.
arXiv Detail & Related papers (2024-06-03T17:56:58Z) - Efficiently Learning One-Hidden-Layer ReLU Networks via Schur
Polynomials [50.90125395570797]
We study the problem of PAC learning a linear combination of $k$ ReLU activations under the standard Gaussian distribution on $mathbbRd$ with respect to the square loss.
Our main result is an efficient algorithm for this learning task with sample and computational complexity $(dk/epsilon)O(k)$, whereepsilon>0$ is the target accuracy.
arXiv Detail & Related papers (2023-07-24T14:37:22Z) - Average-Case Complexity of Tensor Decomposition for Low-Degree
Polynomials [93.59919600451487]
"Statistical-computational gaps" occur in many statistical inference tasks.
We consider a model for random order-3 decomposition where one component is slightly larger in norm than the rest.
We show that tensor entries can accurately estimate the largest component when $ll n3/2$ but fail to do so when $rgg n3/2$.
arXiv Detail & Related papers (2022-11-10T00:40:37Z) - Statistically Meaningful Approximation: a Case Study on Approximating
Turing Machines with Transformers [50.85524803885483]
This work proposes a formal definition of statistically meaningful (SM) approximation which requires the approximating network to exhibit good statistical learnability.
We study SM approximation for two function classes: circuits and Turing machines.
arXiv Detail & Related papers (2021-07-28T04:28:55Z) - On Function Approximation in Reinforcement Learning: Optimism in the
Face of Large State Spaces [208.67848059021915]
We study the exploration-exploitation tradeoff at the core of reinforcement learning.
In particular, we prove that the complexity of the function class $mathcalF$ characterizes the complexity of the function.
Our regret bounds are independent of the number of episodes.
arXiv Detail & Related papers (2020-11-09T18:32:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.