Related papers: Learning the greatest common divisor: explaining transformer predictions

Learning the greatest common divisor: explaining transformer predictions

URL: http://arxiv.org/abs/2308.15594v2
Date: Thu, 14 Mar 2024 20:47:17 GMT
Title: Learning the greatest common divisor: explaining transformer predictions
Authors: François Charton,
Abstract summary: The predictions of small transformers can be fully characterized by looking at model inputs and outputs. The model learns a list $mathcal D$ of integers, products of divisors of the base used to represent integers and small primes, and predicts the largest element of $mathcal D$ that divides both inputs.
Score: 8.430481660019451
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The predictions of small transformers, trained to calculate the greatest common divisor (GCD) of two positive integers, can be fully characterized by looking at model inputs and outputs. As training proceeds, the model learns a list $\mathcal D$ of integers, products of divisors of the base used to represent integers and small primes, and predicts the largest element of $\mathcal D$ that divides both inputs. Training distributions impact performance. Models trained from uniform operands only learn a handful of GCD (up to $38$ GCD $\leq100$). Log-uniform operands boost performance to $73$ GCD $\leq 100$, and a log-uniform distribution of outcomes (i.e. GCD) to $91$. However, training from uniform (balanced) GCD breaks explainability.

Related papers

A Fast Multiplication Algorithm and RLWE-PLWE Equivalence for the Maximal Real Subfield of the $2^r p^s$-th Cyclotomic Field [0.0]
We prove the RLWE-PLWE equivalence for the maximal real subfields of the cyclotomic fields with conductor $n = 2r ps$. We also describe a fast multiplication algorithm in the ring of integers of these real subfields.
arXiv Detail & Related papers (2025-04-07T15:01:48Z)
IT$^3$: Idempotent Test-Time Training [95.78053599609044]
This paper introduces Idempotent Test-Time Training (IT$3$), a novel approach to addressing the challenge of distribution shift. IT$3$ is based on the universal property of idempotence. We demonstrate the versatility of our approach across various tasks, including corrupted image classification.
arXiv Detail & Related papers (2024-10-05T15:39:51Z)
Scaling Behavior for Large Language Models regarding Numeral Systems: An Example using Pythia [55.23627698804683]
We study the scaling behavior of different numeral systems in the context of transformer-based large language models. A base $10$ system is consistently more data-efficient than a base $102$ or $103$ system across training data scale. We identify that base $100$ and base $1000$ systems struggle on token-level discernment and token-level operations.
arXiv Detail & Related papers (2024-09-25T22:08:31Z)
Models That Prove Their Own Correctness [2.6570606951261015]
We train Self-Proving models that prove the correctness of their output to a verification algorithm $V$ via an Interactive Proof. With high probability over a random input, the model generates a correct output *and* successfully proves its correctness to $V!$. Our learning method is used to train a Self-Proving transformer that computes the GCD *and* proves the correctness of its answer.
arXiv Detail & Related papers (2024-05-24T17:10:08Z)
Low-Complexity Integer Divider Architecture for Homomorphic Encryption [5.857929080874288]
Homomorphic encryption (HE) allows computations to be directly carried out on ciphertexts and enables privacy-preserving cloud computing. An algorithm is proposed to compute the quotient and vigorous mathematical proofs are provided.
arXiv Detail & Related papers (2024-01-19T23:53:59Z)
Length Generalization in Arithmetic Transformers [41.62455986786115]
We show how transformers cope with two challenges: learning basic integer arithmetic, and generalizing to longer sequences than seen during training. We propose train set priming: adding a few ($10$ to $50$) long sequences to the training set. We show that priming allows models trained on $5$-digit $times$ $3$-digit multiplications to generalize to $35times 3$ examples.
arXiv Detail & Related papers (2023-06-27T11:53:25Z)
Learning Division with Neural Arithmetic Logic Modules [2.019622939313173]
We show that robustly learning division in a systematic manner remains a challenge even at the simplest level of dividing two numbers. We propose two novel approaches for division which we call the Neural Reciprocal Unit (NRU) and the Neural Multiplicative Reciprocal Unit (NMRU)
arXiv Detail & Related papers (2021-10-11T11:56:57Z)
Under-bagging Nearest Neighbors for Imbalanced Classification [63.026765294759876]
We propose an ensemble learning algorithm called textitunder-bagging $k$-NN (textitunder-bagging $k$-NN) for imbalanced classification problems.
arXiv Detail & Related papers (2021-09-01T14:10:38Z)
Learning elliptic partial differential equations with randomized linear algebra [2.538209532048867]
We show that one can construct an approximant to $G$ that converges almost surely. The quantity $0Gamma_epsilonleq 1$ characterizes the quality of the training dataset.
arXiv Detail & Related papers (2021-01-31T16:57:59Z)
Improving Robustness and Generality of NLP Models Using Disentangled Representations [62.08794500431367]
Supervised neural networks first map an input $x$ to a single representation $z$, and then map $z$ to the output label $y$. We present methods to improve robustness and generality of NLP models from the standpoint of disentangled representation learning. We show that models trained with the proposed criteria provide better robustness and domain adaptation ability in a wide range of supervised learning tasks.
arXiv Detail & Related papers (2020-09-21T02:48:46Z)
On the Theory of Transfer Learning: The Importance of Task Diversity [114.656572506859]
We consider $t+1$ tasks parameterized by functions of the form $f_j circ h$ in a general function class $mathcalF circ mathcalH$. We show that for diverse training tasks the sample complexity needed to learn the shared representation across the first $t$ training tasks scales as $C(mathcalH) + t C(mathcalF)$.
arXiv Detail & Related papers (2020-06-20T20:33:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.