Related papers: Scaling Behavior for Large Language Models regarding Numeral Systems: An Example using Pythia

Scaling Behavior for Large Language Models regarding Numeral Systems: An Example using Pythia

URL: http://arxiv.org/abs/2409.17391v2
Date: Fri, 27 Sep 2024 02:18:22 GMT
Title: Scaling Behavior for Large Language Models regarding Numeral Systems: An Example using Pythia
Authors: Zhejian Zhou, Jiayu Wang, Dahua Lin, Kai Chen,
Abstract summary: We study the scaling behavior of different numeral systems in the context of transformer-based large language models. A base $10$ system is consistently more data-efficient than a base $102$ or $103$ system across training data scale. We identify that base $100$ and base $1000$ systems struggle on token-level discernment and token-level operations.
Score: 55.23627698804683
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Though Large Language Models (LLMs) have shown remarkable abilities in mathematics reasoning, they are still struggling with performing numeric operations accurately, such as addition and multiplication. Numbers can be tokenized into tokens in various ways by different LLMs and affect the numeric operations performance. Currently, there are two representatives: 1) Tokenize into $1$-digit, and 2) Tokenize into $1\sim 3$ digit. The difference is roughly equivalent to using different numeral systems (namely base $10$ or base $10^{3}$). In light of this, we study the scaling behavior of different numeral systems in the context of transformer-based large language models. We empirically show that a base $10$ system is consistently more data-efficient than a base $10^{2}$ or $10^{3}$ system across training data scale, model sizes under from-scratch training settings, while different number systems have very similar fine-tuning performances. We attribute this to higher token frequencies of a base $10$ system. Additionally, we reveal extrapolation behavior patterns on addition and multiplication. We identify that base $100$ and base $1000$ systems struggle on token-level discernment and token-level operations. We also sheds light on the mechanism learnt by the models.

Related papers

Geometric Learning Dynamics [0.0]
We present a unified framework for modeling learning dynamics in physical, biological, and machine learning systems. The quantum regime corresponds to $a = 1$ and describes Schr"odinger-like dynamics that emerges from a discrete shift symmetry. The efficient learning regime corresponds to $a = tfrac12$ and describes very fast machine learning algorithms.
arXiv Detail & Related papers (2025-04-20T19:56:41Z)
FoNE: Precise Single-Token Number Embeddings via Fourier Features [51.17846016593835]
We propose a novel method that maps numbers into the embedding space with their Fourier features. FoNE encodes each number as a single token with only two embedding dimensions per digit, effectively capturing numerical values without fragmentation. On 6-digit decimal addition, FoNE requires 64$times$ less data to achieve 99% accuracy than subword and digit-wise embeddings. FoNE is the only method that yields 100% accuracy on over 100,000 test examples for addition, subtraction, and multiplication.
arXiv Detail & Related papers (2025-02-13T19:54:59Z)
Exploring Curriculum Learning for Vision-Language Tasks: A Study on Small-Scale Multimodal Training [4.062463195973711]
We investigate the role of 3 primary variables in a limited data regime as part of the BabyLM challenge. We find that curriculum learning benefits multimodal evaluations over non-curriclum learning models.
arXiv Detail & Related papers (2024-10-20T21:03:51Z)
Teaching Transformers Modular Arithmetic at Scale [9.68892691572611]
This work proposes three changes to the modular addition model training pipeline. We demonstrate success with our approach for $N = 256, q = 3329$, a case which is interesting for cryptographic applications.
arXiv Detail & Related papers (2024-10-04T16:19:33Z)
Language models scale reliably with over-training and on downstream tasks [121.69867718185125]
Scaling laws are useful guides for derisking expensive training runs. However, there remain gaps between current studies and how language models are trained. In contrast, scaling laws mostly predict loss on inference, but models are usually compared on downstream task performance.
arXiv Detail & Related papers (2024-03-13T13:54:00Z)
Positional Description Matters for Transformers Arithmetic [58.4739272381373]
Transformers often falter on arithmetic tasks despite their vast capabilities. We propose several ways to fix the issue, either by modifying the positional encoding directly, or by modifying the representation of the arithmetic task to leverage standard positional encoding differently.
arXiv Detail & Related papers (2023-11-22T00:31:01Z)
Length Generalization in Arithmetic Transformers [41.62455986786115]
We show how transformers cope with two challenges: learning basic integer arithmetic, and generalizing to longer sequences than seen during training. We propose train set priming: adding a few ($10$ to $50$) long sequences to the training set. We show that priming allows models trained on $5$-digit $times$ $3$-digit multiplications to generalize to $35times 3$ examples.
arXiv Detail & Related papers (2023-06-27T11:53:25Z)
Compressing 1D Time-Channel Separable Convolutions using Sparse Random Ternary Matrices [65.4388266814055]
We replace 1x1-convolutions in 1D time-channel separable convolutions with constant, sparse random ternary matrices with weights in $-1,0,+1$. For command recognition on Google Speech Commands v1, we improve the state-of-the-art accuracy from $97.21%$ to $97.41%$ at the same network size. For speech recognition on Librispeech, we half the number of weights to be trained while only sacrificing about $1%$ of the floating-point baseline's word error rate.
arXiv Detail & Related papers (2021-03-31T15:09:20Z)
Investigating the Limitations of the Transformers with Simple Arithmetic Tasks [10.23804850480924]
We find that how a number is represented in its surface form has a strong influence on the model's accuracy. We conclude that modern pretrained language models can easily learn arithmetic from very few examples.
arXiv Detail & Related papers (2021-02-25T17:22:53Z)
Improving Robustness and Generality of NLP Models Using Disentangled Representations [62.08794500431367]
Supervised neural networks first map an input $x$ to a single representation $z$, and then map $z$ to the output label $y$. We present methods to improve robustness and generality of NLP models from the standpoint of disentangled representation learning. We show that models trained with the proposed criteria provide better robustness and domain adaptation ability in a wide range of supervised learning tasks.
arXiv Detail & Related papers (2020-09-21T02:48:46Z)
On the Theory of Transfer Learning: The Importance of Task Diversity [114.656572506859]
We consider $t+1$ tasks parameterized by functions of the form $f_j circ h$ in a general function class $mathcalF circ mathcalH$. We show that for diverse training tasks the sample complexity needed to learn the shared representation across the first $t$ training tasks scales as $C(mathcalH) + t C(mathcalF)$.
arXiv Detail & Related papers (2020-06-20T20:33:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.