Related papers: Acceleration of Grokking in Learning Arithmetic Operations via Kolmogorov-Arnold Representation

Acceleration of Grokking in Learning Arithmetic Operations via Kolmogorov-Arnold Representation

URL: http://arxiv.org/abs/2405.16658v1
Date: Sun, 26 May 2024 18:29:24 GMT
Title: Acceleration of Grokking in Learning Arithmetic Operations via Kolmogorov-Arnold Representation
Authors: Yeachan Park, Minseok Kim, Yeoneung Kim,
Abstract summary: We focus on the grokking phenomenon that arises in learning arithmetic binary operations via the transformer model. We suggest various transfer learning mechanisms that expedite grokking.
Score: 3.7812707887425048
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose novel methodologies aimed at accelerating the grokking phenomenon, which refers to the rapid increment of test accuracy after a long period of overfitting as reported in~\cite{power2022grokking}. Focusing on the grokking phenomenon that arises in learning arithmetic binary operations via the transformer model, we begin with a discussion on data augmentation in the case of commutative binary operations. To further accelerate, we elucidate arithmetic operations through the lens of the Kolmogorov-Arnold (KA) representation theorem, revealing its correspondence to the transformer architecture: embedding, decoder block, and classifier. Observing the shared structure between KA representations associated with binary operations, we suggest various transfer learning mechanisms that expedite grokking. This interpretation is substantiated through a series of rigorous experiments. In addition, our approach is successful in learning two nonstandard arithmetic tasks: composition of operations and a system of equations. Furthermore, we reveal that the model is capable of learning arithmetic operations using a limited number of tokens under embedding transfer, which is supported by a set of experiments as well.

Related papers

Learning Modular Exponentiation with Transformers [0.0]
We train a 4-layer encoder-decoder Transformer model to perform modular exponentiation.<n>We find that reciprocal training leads to strong performance gains, with sudden generalization across related moduli.<n>These results suggest that transformer models learn modular arithmetic through specialized computational circuits.
arXiv Detail & Related papers (2025-06-30T10:00:44Z)
EulerFormer: Sequential User Behavior Modeling with Complex Vector Attention [88.45459681677369]
We propose a novel transformer variant with complex vector attention, named EulerFormer. It provides a unified theoretical framework to formulate both semantic difference and positional difference. It is more robust to semantic variations and possesses moresuperior theoretical properties in principle.
arXiv Detail & Related papers (2024-03-26T14:18:43Z)
Bit Cipher -- A Simple yet Powerful Word Representation System that Integrates Efficiently with Language Models [4.807347156077897]
Bit-cipher is a word representation system that eliminates the need of backpropagation and hyper-efficient dimensionality reduction techniques. We perform probing experiments on part-of-speech (POS) tagging and named entity recognition (NER) to assess bit-cipher's competitiveness with classic embeddings. By replacing embedding layers with cipher embeddings, our experiments illustrate the notable efficiency of cipher in accelerating the training process and attaining better optima.
arXiv Detail & Related papers (2023-11-18T08:47:35Z)
Towards Understanding How Transformers Learn In-context Through a Representation Learning Lens [9.590540796223715]
In this paper, we attempt to explore the in-context learning process in Transformers through a lens of representation learning. The ICL inference process of the attention layer aligns with the training procedure of its dual model, generating token representation predictions. We extend our theoretical conclusions to more complicated scenarios, including one Transformer layer and multiple attention layers.
arXiv Detail & Related papers (2023-10-20T01:55:34Z)
In-Context Convergence of Transformers [63.04956160537308]
We study the learning dynamics of a one-layer transformer with softmax attention trained via gradient descent. For data with imbalanced features, we show that the learning dynamics take a stage-wise convergence process.
arXiv Detail & Related papers (2023-10-08T17:55:33Z)
Homomorphism Autoencoder -- Learning Group Structured Representations from Observed Transitions [51.71245032890532]
We propose methods enabling an agent acting upon the world to learn internal representations of sensory information consistent with actions that modify it. In contrast to existing work, our approach does not require prior knowledge of the group and does not restrict the set of actions the agent can perform.
arXiv Detail & Related papers (2022-07-25T11:22:48Z)
Inducing Transformer's Compositional Generalization Ability via Auxiliary Sequence Prediction Tasks [86.10875837475783]
Systematic compositionality is an essential mechanism in human language, allowing the recombination of known parts to create novel expressions. Existing neural models have been shown to lack this basic ability in learning symbolic structures. We propose two auxiliary sequence prediction tasks that track the progress of function and argument semantics.
arXiv Detail & Related papers (2021-09-30T16:41:19Z)
Topographic VAEs learn Equivariant Capsules [84.33745072274942]
We introduce the Topographic VAE: a novel method for efficiently training deep generative models with topographically organized latent variables. We show that such a model indeed learns to organize its activations according to salient characteristics such as digit class, width, and style on MNIST. We demonstrate approximate equivariance to complex transformations, expanding upon the capabilities of existing group equivariant neural networks.
arXiv Detail & Related papers (2021-09-03T09:25:57Z)
Abelian Neural Networks [48.52497085313911]
We first construct a neural network architecture for Abelian group operations and derive a universal approximation property. We extend it to Abelian semigroup operations using the characterization of associative symmetrics. We train our models over fixed word embeddings and demonstrate improved performance over the original word2vec.
arXiv Detail & Related papers (2021-02-24T11:52:21Z)
Improving Transformation Invariance in Contrastive Representation Learning [31.223892428863238]
We introduce a training objective for contrastive learning that uses a novel regularizer to control how the representation changes under transformation. Second, we propose a change to how test time representations are generated by introducing a feature averaging approach that combines encodings from multiple transformations of the original input. Third, we introduce the novel Spirograph dataset to explore our ideas in the context of a differentiable generative process with multiple downstream tasks.
arXiv Detail & Related papers (2020-10-19T13:49:29Z)
Lattice Representation Learning [6.427169570069738]
We introduce theory and algorithms for learning discrete representations that take on a lattice that is embedded in an Euclidean space. Lattice representations possess an interesting combination of properties: a) they can be computed explicitly using lattice quantization, yet they can be learned efficiently using the ideas we introduce. This article will focus on laying the groundwork for exploring and exploiting the first two properties, including a new mathematical result linking expressions used during training and inference time and experimental validation on two popular datasets.
arXiv Detail & Related papers (2020-06-24T16:05:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.