Pre-trained Large Language Models Use Fourier Features to Compute Addition
- URL: http://arxiv.org/abs/2406.03445v1
- Date: Wed, 5 Jun 2024 16:40:53 GMT
- Title: Pre-trained Large Language Models Use Fourier Features to Compute Addition
- Authors: Tianyi Zhou, Deqing Fu, Vatsal Sharan, Robin Jia,
- Abstract summary: Pre-trained large language models (LLMs) exhibit impressive mathematical reasoning capabilities.
How they compute basic arithmetic, such as addition, remains unclear.
- Score: 37.56242478466735
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pre-trained large language models (LLMs) exhibit impressive mathematical reasoning capabilities, yet how they compute basic arithmetic, such as addition, remains unclear. This paper shows that pre-trained LLMs add numbers using Fourier features -- dimensions in the hidden state that represent numbers via a set of features sparse in the frequency domain. Within the model, MLP and attention layers use Fourier features in complementary ways: MLP layers primarily approximate the magnitude of the answer using low-frequency features, while attention layers primarily perform modular addition (e.g., computing whether the answer is even or odd) using high-frequency features. Pre-training is crucial for this mechanism: models trained from scratch to add numbers only exploit low-frequency features, leading to lower accuracy. Introducing pre-trained token embeddings to a randomly initialized model rescues its performance. Overall, our analysis demonstrates that appropriate pre-trained representations (e.g., Fourier features) can unlock the ability of Transformers to learn precise mechanisms for algorithmic tasks.
Related papers
- Leveraging FourierKAN Classification Head for Pre-Trained Transformer-based Text Classification [0.51795041186793]
We introduce FR-KAN, a variant of the promising alternative called Kolmogorov-Arnold Networks (KANs) as classification heads for transformer-based encoders.
Our studies reveal an average increase of 10% in accuracy and 11% in F1-score when incorporating traditional heads instead of transformer-based pre-trained models.
arXiv Detail & Related papers (2024-08-16T15:28:02Z) - Emergence in non-neural models: grokking modular arithmetic via average gradient outer product [16.911836722312152]
We show that grokking is not specific to neural networks nor to gradient descent-based optimization.
We show that this phenomenon occurs when learning modular arithmetic with Recursive Feature Machines.
Our results demonstrate that emergence can result purely from learning task-relevant features.
arXiv Detail & Related papers (2024-07-29T17:28:58Z) - Parameter-Efficient and Memory-Efficient Tuning for Vision Transformer: A Disentangled Approach [87.8330887605381]
We show how to adapt a pre-trained Vision Transformer to downstream recognition tasks with only a few learnable parameters.
We synthesize a task-specific query with a learnable and lightweight module, which is independent of the pre-trained model.
Our method achieves state-of-the-art performance under memory constraints, showcasing its applicability in real-world situations.
arXiv Detail & Related papers (2024-07-09T15:45:04Z) - On Sequential Loss Approximation for Continual Learning [0.0]
We introduce for continual learning Autodiff Quadratic Consolidation (AQC) and Neural Consolidation (NC)
AQC approximates the previous loss function with a quadratic function, and NC approximates the previous loss function with a neural network.
We empirically study these methods in class-incremental learning, for which regularization-based methods produce unsatisfactory results.
arXiv Detail & Related papers (2024-05-26T09:20:47Z) - Approximated Prompt Tuning for Vision-Language Pre-trained Models [54.326232586461614]
In vision-language pre-trained models, prompt tuning often requires a large number of learnable tokens to bridge the gap between the pre-training and downstream tasks.
We propose a novel Approximated Prompt Tuning (APT) approach towards efficient VL transfer learning.
arXiv Detail & Related papers (2023-06-27T05:43:47Z) - Basis Function Encoding of Numerical Features in Factorization Machines
for Improved Accuracy [2.3022070933226217]
We provide a systematic and theoretically-justified way to incorporate numerical features into FM variants.
We show that our technique yields a model that learns segmentized functions of the numerical feature spanned by the set of functions of one's choice.
Our technique preserves fast training and inference, and requires only a small modification of the computational graph of an FM model.
arXiv Detail & Related papers (2023-05-23T21:10:17Z) - Transformers Can Do Bayesian Inference [56.99390658880008]
We present Prior-Data Fitted Networks (PFNs)
PFNs leverage in-context learning in large-scale machine learning techniques to approximate a large set of posteriors.
We demonstrate that PFNs can near-perfectly mimic Gaussian processes and also enable efficient Bayesian inference for intractable problems.
arXiv Detail & Related papers (2021-12-20T13:07:39Z) - Learning Set Functions that are Sparse in Non-Orthogonal Fourier Bases [73.53227696624306]
We present a new family of algorithms for learning Fourier-sparse set functions.
In contrast to other work that focused on the Walsh-Hadamard transform, our novel algorithms operate with recently introduced non-orthogonal Fourier transforms.
We demonstrate effectiveness on several real-world applications.
arXiv Detail & Related papers (2020-10-01T14:31:59Z) - Fourier Features Let Networks Learn High Frequency Functions in Low
Dimensional Domains [69.62456877209304]
We show that passing input points through a simple Fourier feature mapping enables a multilayer perceptron to learn high-frequency functions.
Results shed light on advances in computer vision and graphics that achieve state-of-the-art results.
arXiv Detail & Related papers (2020-06-18T17:59:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.