Related papers: Divine Benevolence is an $x^2$: GLUs scale asymptotically faster than MLPs

Divine Benevolence is an $x^2$: GLUs scale asymptotically faster than MLPs

URL: http://arxiv.org/abs/2602.14495v2
Date: Tue, 24 Feb 2026 07:58:02 GMT
Title: Divine Benevolence is an $x^2$: GLUs scale asymptotically faster than MLPs
Authors: Alejandro Francisco Queiruga,
Abstract summary: Scaling laws can be understood from ground-up numerical analysis.<n>GLU variants now dominate frontier LLMs and similar outer-product architectures are prevalent in ranking models.<n>We show that GLUs have piecewise quadratic functional forms that are sufficient to exhibit quadratic order of approximation.<n>This opens the possibility of architecture design from first principles numerical theory to unlock superior scaling in large models.
Score: 51.56484100374058
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Scaling laws can be understood from ground-up numerical analysis, where traditional function approximation theory can explain shifts in model architecture choices. GLU variants now dominate frontier LLMs and similar outer-product architectures are prevalent in ranking models. The success of these architectures has mostly been left as an empirical discovery. In this paper, we apply the tools of numerical analysis to expose a key factor: these models have an $x^2$ which enables \emph{asymptotically} faster scaling than MLPs. GLUs have piecewise quadratic functional forms that are sufficient to exhibit quadratic order of approximation. Our key contribution is to demonstrate that the $L(P)$ scaling slope is $L(P)\propto P^{-3}$ for GLUs but only $L(P)=P^{-2}$ for MLPs on function reconstruction problems. We provide a parameter construction and empirical verification of these slopes for 1D function approximation. From the first principles we discover, we make one stride and propose the ``Gated Quadratic Unit'' which has an even steeper $L(P)$ slope than the GLU and MLP. This opens the possibility of architecture design from first principles numerical theory to unlock superior scaling in large models. Replication code is available at https://github.com/afqueiruga/divine_scaling.

Related papers

Learning Orthogonal Multi-Index Models: A Fine-Grained Information Exponent Analysis [54.57279006229212]
Information exponent has played an important role in predicting the sample complexity of online gradient descent.<n>In this work, we show that by considering both second- and higher-order terms, we can first learn the relevant space using the second-order terms.<n>The overall sample and complexity of online SGD is $tildeO( d PL-1 )$.
arXiv Detail & Related papers (2024-10-13T00:14:08Z)
PanGu-$\pi$: Enhancing Language Model Architectures via Nonlinearity Compensation [97.78045712375047]
We present a new efficient model architecture for large language models (LLMs) We show that PanGu-$pi$-7B can achieve a comparable performance to that of benchmarks with about 10% inference speed-up. In addition, we have deployed PanGu-$pi$-7B in the high-value domains of finance and law, developing an LLM named YunShan for practical application.
arXiv Detail & Related papers (2023-12-27T11:49:24Z)
Exploring and Learning in Sparse Linear MDPs without Computationally Intractable Oracles [39.10180309328293]
In this paper we revisit linear MDPs from the perspective of feature selection. Our main result is the first-time algorithm for this problem. We show that they do exist and can be computed efficiently via convex programming.
arXiv Detail & Related papers (2023-09-18T03:35:48Z)
Restricted Strong Convexity of Deep Learning Models with Smooth Activations [31.003601717265006]
We study the problem of optimization of deep learning models with smooth activation functions. We introduce a new analysis of optimization based on Restricted Strong Convexity (RSC) Ours is the first result on establishing geometric convergence of GD based on RSC for deep learning models.
arXiv Detail & Related papers (2022-09-29T21:24:26Z)
Human-in-the-loop: Provably Efficient Preference-based Reinforcement Learning with General Function Approximation [107.54516740713969]
We study human-in-the-loop reinforcement learning (RL) with trajectory preferences. Instead of receiving a numeric reward at each step, the agent only receives preferences over trajectory pairs from a human overseer. We propose the first optimistic model-based algorithm for PbRL with general function approximation.
arXiv Detail & Related papers (2022-05-23T09:03:24Z)
Model Selection with Near Optimal Rates for Reinforcement Learning with General Model Classes [27.361399036211694]
We address the problem of model selection for the finite horizon episodic Reinforcement Learning (RL) problem. In the model selection framework, instead of $mathcalP*$, we are given $M$ nested families of transition kernels. We show that textttARL-GEN obtains a regret of $TildemathcalO(d_mathcalE*H2+sqrtd_mathcalE* mathbbM* H2 T)$
arXiv Detail & Related papers (2021-07-13T05:00:38Z)
Provably Efficient Reinforcement Learning for Discounted MDPs with Feature Mapping [99.59319332864129]
In this paper, we study reinforcement learning for discounted Decision (MDP) We propose a novel algorithm that makes use of the feature mapping and obtains a $tilde O(dsqrtT/ (1-gamma)2)$ regret. Our upper and lower bound results together suggest that the proposed reinforcement learning algorithm is near-optimal up to a $ (1-gamma)-0.5$ factor.
arXiv Detail & Related papers (2020-06-23T17:08:54Z)
Reinforcement Learning with General Value Function Approximation: Provably Efficient Approach via Bounded Eluder Dimension [124.7752517531109]
We establish a provably efficient reinforcement learning algorithm with general value function approximation. We show that our algorithm achieves a regret bound of $widetildeO(mathrmpoly(dH)sqrtT)$ where $d$ is a complexity measure. Our theory generalizes recent progress on RL with linear value function approximation and does not make explicit assumptions on the model of the environment.
arXiv Detail & Related papers (2020-05-21T17:36:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.