Related papers: Teaching Transformers Modular Arithmetic at Scale

Teaching Transformers Modular Arithmetic at Scale

URL: http://arxiv.org/abs/2410.03569v1
Date: Fri, 4 Oct 2024 16:19:33 GMT
Title: Teaching Transformers Modular Arithmetic at Scale
Authors: Eshika Saxena, Alberto Alfarano, Emily Wenger, Kristin Lauter,
Abstract summary: This work proposes three changes to the modular addition model training pipeline. We demonstrate success with our approach for $N = 256, q = 3329$, a case which is interesting for cryptographic applications.
Score: 9.68892691572611
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Modular addition is, on its face, a simple operation: given $N$ elements in $\mathbb{Z}_q$, compute their sum modulo $q$. Yet, scalable machine learning solutions to this problem remain elusive: prior work trains ML models that sum $N \le 6$ elements mod $q \le 1000$. Promising applications of ML models for cryptanalysis-which often involve modular arithmetic with large $N$ and $q$-motivate reconsideration of this problem. This work proposes three changes to the modular addition model training pipeline: more diverse training data, an angular embedding, and a custom loss function. With these changes, we demonstrate success with our approach for $N = 256, q = 3329$, a case which is interesting for cryptographic applications, and a significant increase in $N$ and $q$ over prior work. These techniques also generalize to other modular arithmetic problems, motivating future work.

Related papers

Provable Failure of Language Models in Learning Majority Boolean Logic via Gradient Descent [15.291830857281015]
We investigate whether Transformers can truly learn simple majority functions when trained using gradient-based methods. Our analysis demonstrates that even after $mathrmpoly(d)$ gradient queries, the generalization error of the Transformer model still remains substantially large.
arXiv Detail & Related papers (2025-04-07T03:08:12Z)
Provably Overwhelming Transformer Models with Designed Inputs [0.0]
We say that $mathcalM$ is overwhelmed'' by $s$ when the output of the model evaluated on this string plus any additional string $t$, $mathcalM(s + t)$, is completely insensitive to the value of the string $t$ whenever length($t$) $leq n_free$. We prove a particularly strong worst-case form of over-squashing'', which we use to bound the model's behavior.
arXiv Detail & Related papers (2025-02-09T21:21:57Z)
Scaling Behavior for Large Language Models regarding Numeral Systems: An Example using Pythia [55.23627698804683]
We study the scaling behavior of different numeral systems in the context of transformer-based large language models. A base $10$ system is consistently more data-efficient than a base $102$ or $103$ system across training data scale. We identify that base $100$ and base $1000$ systems struggle on token-level discernment and token-level operations.
arXiv Detail & Related papers (2024-09-25T22:08:31Z)
In-Context Learning Creates Task Vectors [40.990432572831885]
In-context learning (ICL) in Large Language Models (LLMs) has emerged as a powerful new learning paradigm. Here we show that the functions learned by ICL often have a very simple structure. We support the above claim via comprehensive experiments across a range of models and tasks.
arXiv Detail & Related papers (2023-10-24T15:17:14Z)
Length Generalization in Arithmetic Transformers [41.62455986786115]
We show how transformers cope with two challenges: learning basic integer arithmetic, and generalizing to longer sequences than seen during training. We propose train set priming: adding a few ($10$ to $50$) long sequences to the training set. We show that priming allows models trained on $5$-digit $times$ $3$-digit multiplications to generalize to $35times 3$ examples.
arXiv Detail & Related papers (2023-06-27T11:53:25Z)
Transformers Learn Shortcuts to Automata [52.015990420075944]
We find that a low-depth Transformer can represent the computations of any finite-state automaton. We show that a Transformer with $O(log T)$ layers can exactly replicate the computation of an automaton on an input sequence of length $T$. We further investigate the brittleness of these solutions and propose potential mitigations.
arXiv Detail & Related papers (2022-10-19T17:45:48Z)
On the Provable Generalization of Recurrent Neural Networks [7.115768009778412]
We analyze the training and generalization for Recurrent Neural Network (RNN) We prove a generalization error bound to learn functions without normalized conditions. We also prove a novel result to learn N-variables functions of input sequence.
arXiv Detail & Related papers (2021-09-29T02:06:33Z)
Halving the width of Toffoli based constant modular addition to n+3 qubits [69.43216268165402]
We present an arithmetic circuit performing constant modular addition having $mathcalO(n)$ depth of Toffoli gates. This is an improvement by a factor of two compared to the width of the state-of-the-art Toffoli-based constant modular adder.
arXiv Detail & Related papers (2021-02-06T17:07:48Z)
Improving Robustness and Generality of NLP Models Using Disentangled Representations [62.08794500431367]
Supervised neural networks first map an input $x$ to a single representation $z$, and then map $z$ to the output label $y$. We present methods to improve robustness and generality of NLP models from the standpoint of disentangled representation learning. We show that models trained with the proposed criteria provide better robustness and domain adaptation ability in a wide range of supervised learning tasks.
arXiv Detail & Related papers (2020-09-21T02:48:46Z)
On the Theory of Transfer Learning: The Importance of Task Diversity [114.656572506859]
We consider $t+1$ tasks parameterized by functions of the form $f_j circ h$ in a general function class $mathcalF circ mathcalH$. We show that for diverse training tasks the sample complexity needed to learn the shared representation across the first $t$ training tasks scales as $C(mathcalH) + t C(mathcalF)$.
arXiv Detail & Related papers (2020-06-20T20:33:59Z)
On the Modularity of Hypernetworks [103.1147622394852]
We show that for a structured target function, the overall number of trainable parameters in a hypernetwork is smaller by orders of magnitude than the number of trainable parameters of a standard neural network and an embedding method.
arXiv Detail & Related papers (2020-02-23T22:51:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.