xVal: A Continuous Number Encoding for Large Language Models
- URL: http://arxiv.org/abs/2310.02989v1
- Date: Wed, 4 Oct 2023 17:26:16 GMT
- Title: xVal: A Continuous Number Encoding for Large Language Models
- Authors: Siavash Golkar, Mariel Pettee, Michael Eickenberg, Alberto Bietti,
Miles Cranmer, Geraud Krawezik, Francois Lanusse, Michael McCabe, Ruben
Ohana, Liam Parker, Bruno R\'egaldo-Saint Blancard, Tiberiu Tesileanu,
Kyunghyun Cho, Shirley Ho
- Abstract summary: We propose xVal, a numerical encoding scheme that represents any real number using just a single token.
We empirically evaluate our proposal on a number of synthetic and real-world datasets.
- Score: 42.19323262199993
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models have not yet been broadly adapted for the analysis of
scientific datasets due in part to the unique difficulties of tokenizing
numbers. We propose xVal, a numerical encoding scheme that represents any real
number using just a single token. xVal represents a given real number by
scaling a dedicated embedding vector by the number value. Combined with a
modified number-inference approach, this strategy renders the model end-to-end
continuous when considered as a map from the numbers of the input string to
those of the output string. This leads to an inductive bias that is generally
more suitable for applications in scientific domains. We empirically evaluate
our proposal on a number of synthetic and real-world datasets. Compared with
existing number encoding schemes, we find that xVal is more token-efficient and
demonstrates improved generalization.
Related papers
- Language Modeling on Tabular Data: A Survey of Foundations, Techniques and Evolution [7.681258910515419]
Tabular data presents unique challenges due to its heterogeneous nature and complex structural relationships.
High predictive performance and robustness in tabular data analysis holds significant promise for numerous applications.
The recent advent of large language models, such as GPT and LLaMA, has further revolutionized the field, facilitating more advanced and diverse applications with minimal fine-tuning.
arXiv Detail & Related papers (2024-08-20T04:59:19Z) - Computational Models to Study Language Processing in the Human Brain: A Survey [47.81066391664416]
This paper reviews efforts in using computational models for brain research, highlighting emerging trends.
Our analysis reveals that no single model outperforms others on all datasets.
arXiv Detail & Related papers (2024-03-20T08:01:22Z) - Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research [139.69207791947738]
Dolma is a three-trillion-token English corpus built from a diverse mixture of web content, scientific papers, code, public-domain books, social media, and encyclopedic materials.
We document Dolma, including its design principles, details about its construction, and a summary of its contents.
We present analyses and experimental results on intermediate states of Dolma to share what we have learned about important data curation practices.
arXiv Detail & Related papers (2024-01-31T20:29:50Z) - Arithmetic with Language Models: from Memorization to Computation [3.077668143048211]
This work investigates how a language model, trained to predict the next token, can perform arithmetic computations generalizing beyond training data.
We successfully trained a light language model to learn these tasks and ran a number of experiments to investigate the extrapolation capabilities and internal information processing.
arXiv Detail & Related papers (2023-08-02T13:58:37Z) - MIST: a Large-Scale Annotated Resource and Neural Models for Functions
of Modal Verbs in English Scientific Text [1.8502316793903635]
We introduce the MIST dataset, which contains 3737 modal instances in five scientific domains annotated for their semantic, pragmatic, or rhetorical function.
We systematically evaluate a set of competitive neural architectures on MIST.
Our corpus analysis provides evidence that scientific communities differ in their usage of modal verbs.
arXiv Detail & Related papers (2022-12-14T11:10:03Z) - Detecting Text Formality: A Study of Text Classification Approaches [78.11745751651708]
This work proposes the first to our knowledge systematic study of formality detection methods based on statistical, neural-based, and Transformer-based machine learning methods.
We conducted three types of experiments -- monolingual, multilingual, and cross-lingual.
The study shows the overcome of Char BiLSTM model over Transformer-based ones for the monolingual and multilingual formality classification task.
arXiv Detail & Related papers (2022-04-19T16:23:07Z) - A Cognitive Regularizer for Language Modeling [36.256053903862956]
We augment the canonical MLE objective for training language models by encoding UID as regularization.
We find that using UID regularization consistently improves perplexity in language models.
We also find that UID-regularized language models are higher-entropy and produce text that is longer and more lexically diverse.
arXiv Detail & Related papers (2021-05-15T05:37:42Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Data Augmentation for Spoken Language Understanding via Pretrained
Language Models [113.56329266325902]
Training of spoken language understanding (SLU) models often faces the problem of data scarcity.
We put forward a data augmentation method using pretrained language models to boost the variability and accuracy of generated utterances.
arXiv Detail & Related papers (2020-04-29T04:07:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.