Transformers know more than they can tell -- Learning the Collatz sequence
- URL: http://arxiv.org/abs/2511.10811v1
- Date: Thu, 13 Nov 2025 21:23:24 GMT
- Title: Transformers know more than they can tell -- Learning the Collatz sequence
- Authors: François Charton, Ashvni Narayanan,
- Abstract summary: We investigate transformer prediction of long Collatz steps.<n>Complex arithmetic function maps odd integers to their distant successors.<n>All models, no matter the base, follow a common learning pattern.
- Score: 14.743042834601226
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We investigate transformer prediction of long Collatz steps, a complex arithmetic function that maps odd integers to their distant successors in the Collatz sequence ( $u_{n+1}=u_n/2$ if $u_n$ is even, $u_{n+1}=(3u_n+1)/2$ if $u_n$ is odd). Model accuracy varies with the base used to encode input and output. It can be as high as $99.7\%$ for bases $24$ and $32$, and as low as $37$ and $25\%$ for bases $11$ and $3$. Yet, all models, no matter the base, follow a common learning pattern. As training proceeds, they learn a sequence of classes of inputs that share the same residual modulo $2^p$. Models achieve near-perfect accuracy on these classes, and less than $1\%$ for all other inputs. This maps to a mathematical property of Collatz sequences: the length of the loops involved in the computation of a long Collatz step can be deduced from the binary representation of its input. The learning pattern reflects the model learning to predict inputs associated with increasing loop lengths. An analysis of failure cases reveals that almost all model errors follow predictable patterns. Hallucination, a common feature of large language models, almost never happens. In over $90\%$ of failures, the model performs the correct calculation, but wrongly estimates loop lengths. Our observations give a full account of the algorithms learned by the models. They suggest that the difficulty of learning such complex arithmetic function lies in figuring the control structure of the computation -- the length of the loops. We believe that the approach outlined here, using mathematical problems as tools for understanding, explaining, and perhaps improving language models, can be applied to a broad range of problems and bear fruitful results.
Related papers
- Why Can't Transformers Learn Multiplication? Reverse-Engineering Reveals Long-Range Dependency Pitfalls [54.57326125204404]
Language models are increasingly capable, yet still fail at a seemingly simple task of multi-digit multiplication.<n>We study why, by reverse-engineering a model that successfully learns multiplication via emphimplicit chain-of-thought'
arXiv Detail & Related papers (2025-09-30T19:03:26Z) - Model Enumeration of Two-Variable Logic with Quadratic Delay Complexity [6.164137092509227]
We study the model enumeration problem of the function-free, finite domain fragment of first-order logic with two variables ($FO2$)<n>How can one enumerate all the models of $Gamma$ over a domain of size $n$?
arXiv Detail & Related papers (2025-05-26T08:04:19Z) - Reasoning with Latent Thoughts: On the Power of Looped Transformers [52.84192961524481]
We show that for many synthetic reasoning problems, a $k$-layer transformer looped $L$ times nearly matches the performance of a $kL$-layer non-looped model.<n>Our empirical analysis reveals an intriguing phenomenon: looped and non-looped models exhibit scaling behavior that depends on their effective depth.
arXiv Detail & Related papers (2025-02-24T18:49:05Z) - (How) Can Transformers Predict Pseudo-Random Numbers? [7.201095605457193]
We study the ability of Transformers to learn pseudo-random number sequences from linear congruential generators (LCGs)<n>We find that Transformers can perform in-context prediction of LCG sequences with unseen moduli ($m$) and parameters ($a,c$)<n>We also show that Transformers can generalize to unseen moduli up to $m_texttest = 216$.
arXiv Detail & Related papers (2025-02-14T18:59:40Z) - Making Hard Problems Easier with Custom Data Distributions and Loss Regularization: A Case Study in Modular Arithmetic [30.93087957720688]
We develop techniques that significantly boost the performance of ML models on modular arithmetic tasks.<n>Our core innovation is the use of custom training data distributions and a carefully designed loss function.<n>Our techniques also help ML models learn other well-studied problems better, including copy, associative recall, and parity.
arXiv Detail & Related papers (2024-10-04T16:19:33Z) - Scaling Behavior for Large Language Models regarding Numeral Systems: An Example using Pythia [55.23627698804683]
We study the scaling behavior of different numeral systems in the context of transformer-based large language models.
A base $10$ system is consistently more data-efficient than a base $102$ or $103$ system across training data scale.
We identify that base $100$ and base $1000$ systems struggle on token-level discernment and token-level operations.
arXiv Detail & Related papers (2024-09-25T22:08:31Z) - Agnostically Learning Multi-index Models with Queries [54.290489524576756]
We study the power of query access for the task of agnostic learning under the Gaussian distribution.
We show that query access gives significant runtime improvements over random examples for agnostically learning MIMs.
arXiv Detail & Related papers (2023-12-27T15:50:47Z) - Simple online learning with consistent oracle [55.43220407902113]
We consider online learning in the model where a learning algorithm can access the class only via the emphconsistent oracle -- an oracle, that, at any moment, can give a function from the class that agrees with all examples seen so far.
arXiv Detail & Related papers (2023-08-15T21:50:40Z) - Efficiently Learning One-Hidden-Layer ReLU Networks via Schur
Polynomials [50.90125395570797]
We study the problem of PAC learning a linear combination of $k$ ReLU activations under the standard Gaussian distribution on $mathbbRd$ with respect to the square loss.
Our main result is an efficient algorithm for this learning task with sample and computational complexity $(dk/epsilon)O(k)$, whereepsilon>0$ is the target accuracy.
arXiv Detail & Related papers (2023-07-24T14:37:22Z) - Length Generalization in Arithmetic Transformers [41.62455986786115]
We show how transformers cope with two challenges: learning basic integer arithmetic, and generalizing to longer sequences than seen during training.
We propose train set priming: adding a few ($10$ to $50$) long sequences to the training set.
We show that priming allows models trained on $5$-digit $times$ $3$-digit multiplications to generalize to $35times 3$ examples.
arXiv Detail & Related papers (2023-06-27T11:53:25Z) - On the Provable Generalization of Recurrent Neural Networks [7.115768009778412]
We analyze the training and generalization for Recurrent Neural Network (RNN)
We prove a generalization error bound to learn functions without normalized conditions.
We also prove a novel result to learn N-variables functions of input sequence.
arXiv Detail & Related papers (2021-09-29T02:06:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.