Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models
- URL: http://arxiv.org/abs/2411.03884v1
- Date: Wed, 06 Nov 2024 13:00:34 GMT
- Title: Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models
- Authors: Zhijian Zhuo, Ya Wang, Yutao Zeng, Xiaoqing Li, Xun Zhou, Jinwen Ma,
- Abstract summary: Transformers have found extensive applications across various domains due to the powerful fitting capabilities.
In this paper, we propose a novel category of composition of activations (PolyCom) designed to optimize the dynamics of transformers.
- Score: 18.02269664279028
- License:
- Abstract: Transformers have found extensive applications across various domains due to the powerful fitting capabilities. This success can be partially attributed to their inherent nonlinearity. Thus, in addition to the ReLU function employed in the original transformer architecture, researchers have explored alternative modules such as GeLU and SwishGLU to enhance nonlinearity and thereby augment representational capacity. In this paper, we propose a novel category of polynomial composition activations (PolyCom), designed to optimize the dynamics of transformers. Theoretically, we provide a comprehensive mathematical analysis of PolyCom, highlighting its enhanced expressivity and efficacy relative to other activation functions. Notably, we demonstrate that networks incorporating PolyCom achieve the $\textbf{optimal approximation rate}$, indicating that PolyCom networks require minimal parameters to approximate general smooth functions in Sobolev spaces. We conduct empirical experiments on the pre-training configurations of large language models (LLMs), including both dense and sparse architectures. By substituting conventional activation functions with PolyCom, we enable LLMs to capture higher-order interactions within the data, thus improving performance metrics in terms of accuracy and convergence rates. Extensive experimental results demonstrate the effectiveness of our method, showing substantial improvements over other activation functions. Code is available at https://github.com/BryceZhuo/PolyCom.
Related papers
- PolyCL: Contrastive Learning for Polymer Representation Learning via Explicit and Implicit Augmentations [1.7695773264807546]
We present a self-supervised contrastive learning paradigm, PolyCL, for learning high-quality polymer representation without the need for labels.
Our model combines explicit and implicit augmentation strategies for improved learning performance.
arXiv Detail & Related papers (2024-08-14T13:43:22Z) - Chebyshev Polynomial-Based Kolmogorov-Arnold Networks: An Efficient Architecture for Nonlinear Function Approximation [0.0]
This paper presents the Chebyshev Kolmogorov-Arnold Network (Chebyshev KAN), a new neural network architecture inspired by the Kolmogorov-Arnold theorem.
By utilizing learnable functions parametrized by Chebyshevs on the network's edges, Chebyshev KANs enhance flexibility, efficiency, and interpretability in function approximation tasks.
arXiv Detail & Related papers (2024-05-12T07:55:43Z) - Machine Learning Optimized Orthogonal Basis Piecewise Polynomial Approximation [0.9208007322096533]
Piecewise Polynomials (PPs) are utilized in several engineering disciplines, like trajectory planning, to approximate position profiles given in the form of a set of points.
arXiv Detail & Related papers (2024-03-13T14:34:34Z) - Generalized Activation via Multivariate Projection [46.837481855573145]
Activation functions are essential to introduce nonlinearity into neural networks.
We consider ReLU as a projection from R onto the nonnegative half-line R+.
We extend ReLU by substituting it with a generalized projection operator onto a convex cone, such as the Second-Order Cone (SOC) projection.
arXiv Detail & Related papers (2023-09-29T12:44:27Z) - Learning a Fourier Transform for Linear Relative Positional Encodings in Transformers [71.32827362323205]
We propose a new class of linear Transformers calledLearner-Transformers (Learners)
They incorporate a wide range of relative positional encoding mechanisms (RPEs)
These include regular RPE techniques applied for sequential data, as well as novel RPEs operating on geometric data embedded in higher-dimensional Euclidean spaces.
arXiv Detail & Related papers (2023-02-03T18:57:17Z) - Your Transformer May Not be as Powerful as You Expect [88.11364619182773]
We mathematically analyze the power of RPE-based Transformers regarding whether the model is capable of approximating any continuous sequence-to-sequence functions.
We present a negative result by showing there exist continuous sequence-to-sequence functions that RPE-based Transformers cannot approximate no matter how deep and wide the neural network is.
We develop a novel attention module, called Universal RPE-based (URPE) Attention, which satisfies the conditions.
arXiv Detail & Related papers (2022-05-26T14:51:30Z) - Towards Learning Universal Hyperparameter Optimizers with Transformers [57.35920571605559]
We introduce the OptFormer, the first text-based Transformer HPO framework that provides a universal end-to-end interface for jointly learning policy and function prediction.
Our experiments demonstrate that the OptFormer can imitate at least 7 different HPO algorithms, which can be further improved via its function uncertainty estimates.
arXiv Detail & Related papers (2022-05-26T12:51:32Z) - Activating More Pixels in Image Super-Resolution Transformer [53.87533738125943]
Transformer-based methods have shown impressive performance in low-level vision tasks, such as image super-resolution.
We propose a novel Hybrid Attention Transformer (HAT) to activate more input pixels for better reconstruction.
Our overall method significantly outperforms the state-of-the-art methods by more than 1dB.
arXiv Detail & Related papers (2022-05-09T17:36:58Z) - Adaptively Customizing Activation Functions for Various Layers [10.522556291990437]
In this work, a novel methodology is proposed to adaptively customize activation functions only by adding very few parameters to the traditional activation functions like Sigmoid, Tanh, and ReLU.
To verify the effectiveness of the proposed methodology, some theoretical and experimental analysis on accelerating the convergence and improving the performance is presented.
The results show that the proposed methodology is very simple but with significant performance in convergence speed, precision and generalization, and it can surpass other popular methods like ReLU and adaptive functions like Swish in almost all experiments in terms of overall performance.
arXiv Detail & Related papers (2021-12-17T11:23:03Z) - Efficient Model-Based Multi-Agent Mean-Field Reinforcement Learning [89.31889875864599]
We propose an efficient model-based reinforcement learning algorithm for learning in multi-agent systems.
Our main theoretical contributions are the first general regret bounds for model-based reinforcement learning for MFC.
We provide a practical parametrization of the core optimization problem.
arXiv Detail & Related papers (2021-07-08T18:01:02Z) - Composable Learning with Sparse Kernel Representations [110.19179439773578]
We present a reinforcement learning algorithm for learning sparse non-parametric controllers in a Reproducing Kernel Hilbert Space.
We improve the sample complexity of this approach by imposing a structure of the state-action function through a normalized advantage function.
We demonstrate the performance of this algorithm on learning obstacle-avoidance policies in multiple simulations of a robot equipped with a laser scanner while navigating in a 2D environment.
arXiv Detail & Related papers (2021-03-26T13:58:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.