xVal: A Continuous Numerical Tokenization for Scientific Language Models
- URL: http://arxiv.org/abs/2310.02989v2
- Date: Sun, 15 Dec 2024 07:07:28 GMT
- Title: xVal: A Continuous Numerical Tokenization for Scientific Language Models
- Authors: Siavash Golkar, Mariel Pettee, Michael Eickenberg, Alberto Bietti, Miles Cranmer, Geraud Krawezik, Francois Lanusse, Michael McCabe, Ruben Ohana, Liam Parker, Bruno Régaldo-Saint Blancard, Tiberiu Tesileanu, Kyunghyun Cho, Shirley Ho,
- Abstract summary: We introduce xVal, a strategy for continuously tokenizing numbers within language models.
We train specially-modified language models from scratch on a variety of scientific datasets formatted as text.
- Score: 41.26924657687872
- License:
- Abstract: Due in part to their discontinuous and discrete default encodings for numbers, Large Language Models (LLMs) have not yet been commonly used to process numerically-dense scientific datasets. Rendering datasets as text, however, could help aggregate diverse and multi-modal scientific data into a single training corpus, thereby potentially facilitating the development of foundation models for science. In this work, we introduce xVal, a strategy for continuously tokenizing numbers within language models that results in a more appropriate inductive bias for scientific applications. By training specially-modified language models from scratch on a variety of scientific datasets formatted as text, we find that xVal generally outperforms other common numerical tokenization strategies on metrics including out-of-distribution generalization and computational efficiency.
Related papers
- Language Modeling on Tabular Data: A Survey of Foundations, Techniques and Evolution [7.681258910515419]
Tabular data presents unique challenges due to its heterogeneous nature and complex structural relationships.
High predictive performance and robustness in tabular data analysis holds significant promise for numerous applications.
The recent advent of large language models, such as GPT and LLaMA, has further revolutionized the field, facilitating more advanced and diverse applications with minimal fine-tuning.
arXiv Detail & Related papers (2024-08-20T04:59:19Z) - Computational Models to Study Language Processing in the Human Brain: A Survey [47.81066391664416]
This paper reviews efforts in using computational models for brain research, highlighting emerging trends.
Our analysis reveals that no single model outperforms others on all datasets.
arXiv Detail & Related papers (2024-03-20T08:01:22Z) - Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research [139.69207791947738]
Dolma is a three-trillion-token English corpus built from a diverse mixture of web content, scientific papers, code, public-domain books, social media, and encyclopedic materials.
We document Dolma, including its design principles, details about its construction, and a summary of its contents.
We present analyses and experimental results on intermediate states of Dolma to share what we have learned about important data curation practices.
arXiv Detail & Related papers (2024-01-31T20:29:50Z) - Arithmetic with Language Models: from Memorization to Computation [3.077668143048211]
This work investigates how a language model, trained to predict the next token, can perform arithmetic computations generalizing beyond training data.
We successfully trained a light language model to learn these tasks and ran a number of experiments to investigate the extrapolation capabilities and internal information processing.
arXiv Detail & Related papers (2023-08-02T13:58:37Z) - MIST: a Large-Scale Annotated Resource and Neural Models for Functions
of Modal Verbs in English Scientific Text [1.8502316793903635]
We introduce the MIST dataset, which contains 3737 modal instances in five scientific domains annotated for their semantic, pragmatic, or rhetorical function.
We systematically evaluate a set of competitive neural architectures on MIST.
Our corpus analysis provides evidence that scientific communities differ in their usage of modal verbs.
arXiv Detail & Related papers (2022-12-14T11:10:03Z) - Detecting Text Formality: A Study of Text Classification Approaches [78.11745751651708]
This work proposes the first to our knowledge systematic study of formality detection methods based on statistical, neural-based, and Transformer-based machine learning methods.
We conducted three types of experiments -- monolingual, multilingual, and cross-lingual.
The study shows the overcome of Char BiLSTM model over Transformer-based ones for the monolingual and multilingual formality classification task.
arXiv Detail & Related papers (2022-04-19T16:23:07Z) - A Cognitive Regularizer for Language Modeling [36.256053903862956]
We augment the canonical MLE objective for training language models by encoding UID as regularization.
We find that using UID regularization consistently improves perplexity in language models.
We also find that UID-regularized language models are higher-entropy and produce text that is longer and more lexically diverse.
arXiv Detail & Related papers (2021-05-15T05:37:42Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Data Augmentation for Spoken Language Understanding via Pretrained
Language Models [113.56329266325902]
Training of spoken language understanding (SLU) models often faces the problem of data scarcity.
We put forward a data augmentation method using pretrained language models to boost the variability and accuracy of generated utterances.
arXiv Detail & Related papers (2020-04-29T04:07:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.