Regress, Don't Guess -- A Regression-like Loss on Number Tokens for Language Models
- URL: http://arxiv.org/abs/2411.02083v1
- Date: Mon, 04 Nov 2024 13:43:24 GMT
- Title: Regress, Don't Guess -- A Regression-like Loss on Number Tokens for Language Models
- Authors: Jonas Zausinger, Lars Pennig, Kacper Chlodny, Vincent Limbach, Anna Ketteler, Thorben Prein, Vishwa Mohan Singh, Michael Morris Danziger, Jannis Born,
- Abstract summary: We present two versions of a number token loss for language models.
The first is based on an $L_p$ loss between the ground truth token value and the weighted sum of the predicted class probabilities.
The second loss minimizes the Wasserstein-1 distance between the distribution of the predicted output probabilities and the ground truth distribution.
- Score: 2.5346260093097017
- License:
- Abstract: While language models have exceptional capabilities at text generation, they lack a natural inductive bias for emitting numbers and thus struggle in tasks involving reasoning over quantities, especially arithmetics. This has particular relevance in scientific datasets where combinations of text and numerical data are abundant. One fundamental limitation is the nature of the CE loss, which assumes a nominal (categorical) scale and thus cannot convey proximity between generated number tokens. As a remedy, we here present two versions of a number token loss. The first is based on an $L_p$ loss between the ground truth token value and the weighted sum of the predicted class probabilities. The second loss minimizes the Wasserstein-1 distance between the distribution of the predicted output probabilities and the ground truth distribution. These regression-like losses can easily be added to any language model and extend the CE objective during training. We compare the proposed schemes on a mathematics dataset against existing tokenization, encoding, and decoding schemes for improving number representation in language models. Our results reveal a significant improvement in numerical accuracy when equipping a standard T5 model with the proposed loss schemes.
Related papers
- Understanding and Mitigating Tokenization Bias in Language Models [6.418593476658017]
State-of-the-art language models are autoregressive and operate on subword units known as tokens.
We show that popular encoding schemes induce a sampling bias that cannot be mitigated with more training or data.
We propose a novel algorithm to obtain unbiased estimates from any language model trained on tokenized data.
arXiv Detail & Related papers (2024-06-24T17:38:02Z) - Mitigating Reversal Curse in Large Language Models via Semantic-aware Permutation Training [57.771940716189114]
We show that large language models (LLMs) suffer from the "reversal curse"
The root cause of the reversal curse lies in the different word order between the training and inference stage.
We propose Semantic-aware Permutation Training (SPT) to address this issue.
arXiv Detail & Related papers (2024-03-01T18:55:20Z) - Tokenization counts: the impact of tokenization on arithmetic in
frontier LLMs [3.6722413665749674]
Tokenization is the division of input text into input tokens.
We study the effect this choice has on numerical reasoning through the use of arithmetic tasks.
arXiv Detail & Related papers (2024-02-22T18:14:09Z) - MiLe Loss: a New Loss for Mitigating the Bias of Learning Difficulties in Generative Language Models [40.992566245706996]
We propose a MiLe Loss function for mitigating the bias of learning difficulties with tokens.
We train generative language models at different scales of 468M, 1.2B, and 6.7B parameters.
Experiments reveal that models incorporating the proposed MiLe Loss can gain consistent performance improvement on downstream benchmarks.
arXiv Detail & Related papers (2023-10-30T13:33:21Z) - Estimating Numbers without Regression [30.79061214333164]
Despite recent successes in language models, their ability to represent numbers is insufficient.
Subword tokenization fails to explicitly capture magnitude by splitting numbers into arbitrary chunks.
We show that changing the model's vocabulary instead (eg introduce a new token for numbers in range 10-100) is a far better trade-off.
arXiv Detail & Related papers (2023-10-09T23:07:05Z) - Scaling Data-Constrained Language Models [137.17302576977346]
We investigate scaling language models in data-constrained regimes.
We find that with constrained data for a fixed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data.
We propose and empirically validate a scaling law for compute optimality that accounts for the decreasing value of repeated tokens and excess parameters.
arXiv Detail & Related papers (2023-05-25T17:18:55Z) - Confident Adaptive Language Modeling [95.45272377648773]
CALM is a framework for dynamically allocating different amounts of compute per input and generation timestep.
We demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $times 3$ -- while provably maintaining high performance.
arXiv Detail & Related papers (2022-07-14T17:00:19Z) - Quark: Controllable Text Generation with Reinforced Unlearning [68.07749519374089]
Large-scale language models often learn behaviors that are misaligned with user expectations.
We introduce Quantized Reward Konditioning (Quark), an algorithm for optimizing a reward function that quantifies an (un)wanted property.
For unlearning toxicity, negative sentiment, and repetition, our experiments show that Quark outperforms both strong baselines and state-of-the-art reinforcement learning methods.
arXiv Detail & Related papers (2022-05-26T21:11:51Z) - Evaluating Distributional Distortion in Neural Language Modeling [81.83408583979745]
A heavy-tail of rare events accounts for a significant amount of the total probability mass of distributions in language.
Standard language modeling metrics such as perplexity quantify the performance of language models (LM) in aggregate.
We develop a controlled evaluation scheme which uses generative models trained on natural data as artificial languages.
arXiv Detail & Related papers (2022-03-24T01:09:46Z) - An Empirical Investigation of Contextualized Number Prediction [34.56914472173953]
We consider two tasks: (1)masked number prediction-predicting a missing numerical value within a sentence, and (2)numerical anomaly detection-detecting an errorful numeric value within a sentence.
We introduce a suite of output distribution parameterizations that incorporate latent variables to add expressivity and better fit the natural distribution of numeric values in running text.
We evaluate these models on two numeric datasets in the financial and scientific domain.
arXiv Detail & Related papers (2020-10-20T23:12:23Z) - Parameter Space Factorization for Zero-Shot Learning across Tasks and
Languages [112.65994041398481]
We propose a Bayesian generative model for the space of neural parameters.
We infer the posteriors over such latent variables based on data from seen task-language combinations.
Our model yields comparable or better results than state-of-the-art, zero-shot cross-lingual transfer methods.
arXiv Detail & Related papers (2020-01-30T16:58:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.