Related papers: Pre-trained Large Language Models Use Fourier Features to Compute Addition

Pre-trained Large Language Models Use Fourier Features to Compute Addition

URL: http://arxiv.org/abs/2406.03445v1
Date: Wed, 5 Jun 2024 16:40:53 GMT
Title: Pre-trained Large Language Models Use Fourier Features to Compute Addition
Authors: Tianyi Zhou, Deqing Fu, Vatsal Sharan, Robin Jia,
Abstract summary: Pre-trained large language models (LLMs) exhibit impressive mathematical reasoning capabilities. How they compute basic arithmetic, such as addition, remains unclear.
Score: 37.56242478466735
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Pre-trained large language models (LLMs) exhibit impressive mathematical reasoning capabilities, yet how they compute basic arithmetic, such as addition, remains unclear. This paper shows that pre-trained LLMs add numbers using Fourier features -- dimensions in the hidden state that represent numbers via a set of features sparse in the frequency domain. Within the model, MLP and attention layers use Fourier features in complementary ways: MLP layers primarily approximate the magnitude of the answer using low-frequency features, while attention layers primarily perform modular addition (e.g., computing whether the answer is even or odd) using high-frequency features. Pre-training is crucial for this mechanism: models trained from scratch to add numbers only exploit low-frequency features, leading to lower accuracy. Introducing pre-trained token embeddings to a randomly initialized model rescues its performance. Overall, our analysis demonstrates that appropriate pre-trained representations (e.g., Fourier features) can unlock the ability of Transformers to learn precise mechanisms for algorithmic tasks.

Related papers

Text to Band Gap: Pre-trained Language Models as Encoders for Semiconductor Band Gap Prediction [6.349503549199403]
In this study, we explore the use of a transformer-based language model as an encoder to predict the band gaps of semiconductor materials. We generate material descriptions in two formats: formatted strings combining features and natural language text generated using the ChatGPT API. We demonstrate that the RoBERTa model, pre-trained on natural language processing tasks, performs effectively as an encoder for prediction tasks.
arXiv Detail & Related papers (2025-01-07T00:56:26Z)
Leveraging FourierKAN Classification Head for Pre-Trained Transformer-based Text Classification [0.51795041186793]
We introduce FR-KAN, a variant of the promising alternative called Kolmogorov-Arnold Networks (KANs) as classification heads for transformer-based encoders. Our studies reveal an average increase of 10% in accuracy and 11% in F1-score when incorporating traditional heads instead of transformer-based pre-trained models.
arXiv Detail & Related papers (2024-08-16T15:28:02Z)
Emergence in non-neural models: grokking modular arithmetic via average gradient outer product [16.911836722312152]
We show that grokking is not specific to neural networks nor to gradient descent-based optimization. We show that this phenomenon occurs when learning modular arithmetic with Recursive Feature Machines. Our results demonstrate that emergence can result purely from learning task-relevant features.
arXiv Detail & Related papers (2024-07-29T17:28:58Z)
Parameter-Efficient and Memory-Efficient Tuning for Vision Transformer: A Disentangled Approach [87.8330887605381]
We show how to adapt a pre-trained Vision Transformer to downstream recognition tasks with only a few learnable parameters. We synthesize a task-specific query with a learnable and lightweight module, which is independent of the pre-trained model. Our method achieves state-of-the-art performance under memory constraints, showcasing its applicability in real-world situations.
arXiv Detail & Related papers (2024-07-09T15:45:04Z)
On Sequential Loss Approximation for Continual Learning [0.0]
We introduce for continual learning Autodiff Quadratic Consolidation (AQC) and Neural Consolidation (NC) AQC approximates the previous loss function with a quadratic function, and NC approximates the previous loss function with a neural network. We empirically study these methods in class-incremental learning, for which regularization-based methods produce unsatisfactory results.
arXiv Detail & Related papers (2024-05-26T09:20:47Z)
Approximated Prompt Tuning for Vision-Language Pre-trained Models [54.326232586461614]
In vision-language pre-trained models, prompt tuning often requires a large number of learnable tokens to bridge the gap between the pre-training and downstream tasks. We propose a novel Approximated Prompt Tuning (APT) approach towards efficient VL transfer learning.
arXiv Detail & Related papers (2023-06-27T05:43:47Z)
Basis Function Encoding of Numerical Features in Factorization Machines for Improved Accuracy [2.3022070933226217]
We provide a systematic and theoretically-justified way to incorporate numerical features into FM variants. We show that our technique yields a model that learns segmentized functions of the numerical feature spanned by the set of functions of one's choice. Our technique preserves fast training and inference, and requires only a small modification of the computational graph of an FM model.
arXiv Detail & Related papers (2023-05-23T21:10:17Z)
Transformers Can Do Bayesian Inference [56.99390658880008]
We present Prior-Data Fitted Networks (PFNs) PFNs leverage in-context learning in large-scale machine learning techniques to approximate a large set of posteriors. We demonstrate that PFNs can near-perfectly mimic Gaussian processes and also enable efficient Bayesian inference for intractable problems.
arXiv Detail & Related papers (2021-12-20T13:07:39Z)
Learning Set Functions that are Sparse in Non-Orthogonal Fourier Bases [73.53227696624306]
We present a new family of algorithms for learning Fourier-sparse set functions. In contrast to other work that focused on the Walsh-Hadamard transform, our novel algorithms operate with recently introduced non-orthogonal Fourier transforms. We demonstrate effectiveness on several real-world applications.
arXiv Detail & Related papers (2020-10-01T14:31:59Z)
Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains [69.62456877209304]
We show that passing input points through a simple Fourier feature mapping enables a multilayer perceptron to learn high-frequency functions. Results shed light on advances in computer vision and graphics that achieve state-of-the-art results.
arXiv Detail & Related papers (2020-06-18T17:59:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.