Related papers: Learning Pseudorandom Numbers with Transformers: Permuted Congruential Generators, Curricula, and Interpretability

Learning Pseudorandom Numbers with Transformers: Permuted Congruential Generators, Curricula, and Interpretability

URL: http://arxiv.org/abs/2510.26792v1
Date: Thu, 30 Oct 2025 17:59:09 GMT
Title: Learning Pseudorandom Numbers with Transformers: Permuted Congruential Generators, Curricula, and Interpretability
Authors: Tao Tao, Maissam Barkeshli,
Abstract summary: We study the ability of Transformer models to learn sequences generated by Permuted Congruential Generators (PCGs)<n>PCGs introduce substantial additional difficulty over linear congruential generators (LCGs) by applying a series of bit-wise shifts, XORs, rotations and truncations to the hidden state.<n>We show that Transformers can nevertheless successfully perform in-context prediction on unseen sequences from diverse PCG variants.
Score: 10.75037955193936
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study the ability of Transformer models to learn sequences generated by Permuted Congruential Generators (PCGs), a widely used family of pseudo-random number generators (PRNGs). PCGs introduce substantial additional difficulty over linear congruential generators (LCGs) by applying a series of bit-wise shifts, XORs, rotations and truncations to the hidden state. We show that Transformers can nevertheless successfully perform in-context prediction on unseen sequences from diverse PCG variants, in tasks that are beyond published classical attacks. In our experiments we scale moduli up to $2^{22}$ using up to $50$ million model parameters and datasets with up to $5$ billion tokens. Surprisingly, we find even when the output is truncated to a single bit, it can be reliably predicted by the model. When multiple distinct PRNGs are presented together during training, the model can jointly learn them, identifying structures from different permutations. We demonstrate a scaling law with modulus $m$: the number of in-context sequence elements required for near-perfect prediction grows as $\sqrt{m}$. For larger moduli, optimization enters extended stagnation phases; in our experiments, learning moduli $m \geq 2^{20}$ requires incorporating training data from smaller moduli, demonstrating a critical necessity for curriculum learning. Finally, we analyze embedding layers and uncover a novel clustering phenomenon: the model spontaneously groups the integer inputs into bitwise rotationally-invariant clusters, revealing how representations can transfer from smaller to larger moduli.

Related papers

Machine learning modularity [6.316250090403085]
This work introduces a machine learning framework for automatically simplifying complex expressions involving multiple elliptic Gamma functions.<n>The model learns to apply algebraic identities, particularly the SL$(2,mathbbZ)$ and SL$(2,mathbbZ)$ modular transformations, to reduce heavily scrambled expressions to their canonical forms.
arXiv Detail & Related papers (2026-01-05T04:17:55Z)
Why Can't Transformers Learn Multiplication? Reverse-Engineering Reveals Long-Range Dependency Pitfalls [54.57326125204404]
Language models are increasingly capable, yet still fail at a seemingly simple task of multi-digit multiplication.<n>We study why, by reverse-engineering a model that successfully learns multiplication via emphimplicit chain-of-thought'
arXiv Detail & Related papers (2025-09-30T19:03:26Z)
Transformers in Pseudo-Random Number Generation: A Dual Perspective on Theory and Practice [1.8725832935669624]
Pseudo-random number generators (PRNGs) are high-nonlinear processes, and they are key blocks in optimization of Large language models.<n>We show that it is reasonable to generate high-quality pseudo-random numbers based on transformers.
arXiv Detail & Related papers (2025-08-02T01:31:53Z)
The Generative Leap: Sharp Sample Complexity for Efficiently Learning Gaussian Multi-Index Models [71.5283441529015]
In this work we consider generic Gaussian Multi-index models, in which the labels only depend on the (Gaussian) $d$-dimensional inputs through their projection onto a low-dimensional $r = O_d(1)$ subspace.<n>We introduce the generative leap exponent $kstar$, a natural extension of the generative exponent from [Damian et al.'24] to the multi-index setting.
arXiv Detail & Related papers (2025-06-05T18:34:56Z)
Learning Compositional Functions with Transformers from Easy-to-Hard Data [63.96562216704653]
We study the learnability of the $k$-fold composition task, which requires computing an interleaved composition of $k$ input permutations and $k$ hidden permutations.<n>We show that this function class can be efficiently learned, with runtime and sample in $k$, by gradient descent on an $O(log k)$-depth transformer.
arXiv Detail & Related papers (2025-05-29T17:22:00Z)
(How) Can Transformers Predict Pseudo-Random Numbers? [7.201095605457193]
We study the ability of Transformers to learn pseudo-random number sequences from linear congruential generators (LCGs)<n>We find that Transformers can perform in-context prediction of LCG sequences with unseen moduli ($m$) and parameters ($a,c$)<n>We also show that Transformers can generalize to unseen moduli up to $m_texttest = 216$.
arXiv Detail & Related papers (2025-02-14T18:59:40Z)
Can Transformers Do Enumerative Geometry? [44.99833362998488]
We introduce a Transformer-based approach to computational enumerative geometry.<n>We compute intersection numbers across a value range from $10-45$ to $1045$.<n>We show that the network is implicitly modeling the Virasoro constraints in a purely data-driven manner.
arXiv Detail & Related papers (2024-08-27T09:44:01Z)
Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks [5.358878931933351]
We study the emergence of in-context learning and skill composition in a collection of modular arithmetic tasks. Specifically, we consider a finite collection of linear modular functions $z = a, x + b, y ;mathrmmod; p$ labeled by the vector $(a, b) in mathbbZ_p2$.
arXiv Detail & Related papers (2024-06-04T17:59:36Z)
Computational-Statistical Gaps in Gaussian Single-Index Models [77.1473134227844]
Single-Index Models are high-dimensional regression problems with planted structure. We show that computationally efficient algorithms, both within the Statistical Query (SQ) and the Low-Degree Polynomial (LDP) framework, necessarily require $Omega(dkstar/2)$ samples.
arXiv Detail & Related papers (2024-03-08T18:50:19Z)
Transformers as Statisticians: Provable In-Context Learning with In-Context Algorithm Selection [88.23337313766353]
This work first provides a comprehensive statistical theory for transformers to perform ICL. We show that transformers can implement a broad class of standard machine learning algorithms in context. A emphsingle transformer can adaptively select different base ICL algorithms.
arXiv Detail & Related papers (2023-06-07T17:59:31Z)
Supervised deep learning prediction of the formation enthalpy of the full set of configurations in complex phases: the $\sigma-$phase as an example [1.8369974607582582]
We show how machine learning can be used to predict several properties in solid-state chemistry. In particular, it can be used to predict the heat of formation of a given complex crystallographic phase.
arXiv Detail & Related papers (2020-11-21T22:07:15Z)
Unsupervised Controllable Generation with Self-Training [90.04287577605723]
controllable generation with GANs remains a challenging research problem. We propose an unsupervised framework to learn a distribution of latent codes that control the generator through self-training. Our framework exhibits better disentanglement compared to other variants such as the variational autoencoder.
arXiv Detail & Related papers (2020-07-17T21:50:35Z)
$O(n)$ Connections are Expressive Enough: Universal Approximability of Sparse Transformers [71.31712741938837]
We show that sparse Transformers with only $O(n)$ connections per attention layer can approximate the same function class as the dense model with $n2$ connections. We also present experiments comparing different patterns/levels of sparsity on standard NLP tasks.
arXiv Detail & Related papers (2020-06-08T18:30:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.