Related papers: Provable Benefits of Sinusoidal Activation for Modular Addition

Provable Benefits of Sinusoidal Activation for Modular Addition

URL: http://arxiv.org/abs/2511.23443v1
Date: Fri, 28 Nov 2025 18:37:03 GMT
Title: Provable Benefits of Sinusoidal Activation for Modular Addition
Authors: Tianlong Huang, Zhiyuan Li,
Abstract summary: We first establish a sharp expressivity gap: sines admit width-$2$ exact realizations for any fixed length $m$ and, with bias, width-$2$ exact realizations uniformly over all lengths.<n>We then provide a novel Natarajan-dimension generalization bound for sine networks, yielding nearly optimal sample complexity $widetildemathcalO(p)$ for ERM over constant-width sine networks.<n>We also derive width-independent, margin-based generalization for sine networks in the overparametrized regime and validate it.
Score: 6.836203507099085
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper studies the role of activation functions in learning modular addition with two-layer neural networks. We first establish a sharp expressivity gap: sine MLPs admit width-$2$ exact realizations for any fixed length $m$ and, with bias, width-$2$ exact realizations uniformly over all lengths. In contrast, the width of ReLU networks must scale linearly with $m$ to interpolate, and they cannot simultaneously fit two lengths with different residues modulo $p$. We then provide a novel Natarajan-dimension generalization bound for sine networks, yielding nearly optimal sample complexity $\widetilde{\mathcal{O}}(p)$ for ERM over constant-width sine networks. We also derive width-independent, margin-based generalization for sine networks in the overparametrized regime and validate it. Empirically, sine networks generalize consistently better than ReLU networks across regimes and exhibit strong length extrapolation.

Related papers

Constructive Universal Approximation and Finite Sample Memorization by Narrow Deep ReLU Networks [0.0]
We show that any dataset with $N$ distinct points in $mathbbRd$ and $M$ output classes can be exactly classified.<n>We also prove a universal approximation theorem in $Lp(Omega; mathbbRm)$ for any bounded domain.<n>Our results offer a unified and interpretable framework connecting controllability, expressivity, and training dynamics in deep neural networks.
arXiv Detail & Related papers (2024-09-10T14:31:21Z)
Learning with Norm Constrained, Over-parameterized, Two-layer Neural Networks [54.177130905659155]
Recent studies show that a reproducing kernel Hilbert space (RKHS) is not a suitable space to model functions by neural networks. In this paper, we study a suitable function space for over- parameterized two-layer neural networks with bounded norms.
arXiv Detail & Related papers (2024-04-29T15:04:07Z)
Depth Separation in Norm-Bounded Infinite-Width Neural Networks [55.21840159087921]
We study depth separation in infinite-width neural networks, where complexity is controlled by the overall squared $ell$-norm of the weights. We show that there are functions that are learnable with sample complexity in the input dimension by norm-controlled depth-3 ReLU networks, yet are not learnable with sub-exponential sample complexity by norm-controlled depth-2 ReLU networks.
arXiv Detail & Related papers (2024-02-13T21:26:38Z)
Universal approximation with complex-valued deep narrow neural networks [0.0]
We study the universality of complex-valued neural networks with bounded widths and arbitrary depths. We show that deep narrow complex-valued networks are universal if and only if their activation function is neither holomorphic, nor antiholomorphic, nor $mathbbR$-affine.
arXiv Detail & Related papers (2023-05-26T13:22:14Z)
The Onset of Variance-Limited Behavior for Networks in the Lazy and Rich Regimes [75.59720049837459]
We study the transition from infinite-width behavior to this variance limited regime as a function of sample size $P$ and network width $N$. We find that finite-size effects can become relevant for very small datasets on the order of $P* sim sqrtN$ for regression with ReLU networks.
arXiv Detail & Related papers (2022-12-23T04:48:04Z)
On the Universal Approximation Property of Deep Fully Convolutional Neural Networks [15.716533830931766]
We prove that deep residual fully convolutional networks and their continuous-layer counterpart can achieve universal approximation of symmetric functions at constant channel width. We show that these requirements are necessary, in the sense that networks with fewer channels or smaller kernels fail to be universal approximators.
arXiv Detail & Related papers (2022-11-25T12:02:57Z)
On the Effective Number of Linear Regions in Shallow Univariate ReLU Networks: Convergence Guarantees and Implicit Bias [50.84569563188485]
We show that gradient flow converges in direction when labels are determined by the sign of a target network with $r$ neurons. Our result may already hold for mild over- parameterization, where the width is $tildemathcalO(r)$ and independent of the sample size.
arXiv Detail & Related papers (2022-05-18T16:57:10Z)
Minimum Width for Universal Approximation [91.02689252671291]
We prove that the minimum width required for the universal approximation of the $Lp$ functions is exactly $maxd_x+1,d_y$. We also prove that the same conclusion does not hold for the uniform approximation with ReLU, but does hold with an additional threshold activation function.
arXiv Detail & Related papers (2020-06-16T01:24:21Z)
Neural Networks are Convex Regularizers: Exact Polynomial-time Convex Optimization Formulations for Two-layer Networks [70.15611146583068]
We develop exact representations of training two-layer neural networks with rectified linear units (ReLUs) Our theory utilizes semi-infinite duality and minimum norm regularization.
arXiv Detail & Related papers (2020-02-24T21:32:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.