Learning Numeral Embeddings
- URL: http://arxiv.org/abs/2001.00003v3
- Date: Sat, 11 Jan 2020 14:00:55 GMT
- Title: Learning Numeral Embeddings
- Authors: Chengyue Jiang, Zhonglin Nian, Kaihao Guo, Shanbo Chu, Yinggong Zhao,
Libin Shen, Kewei Tu
- Abstract summary: Existing word embedding methods do not learn numeral embeddings well because there are an infinite number of numerals.
We propose two novel numeral embedding methods that can handle the out-of-vocabulary (OOV) problem for numerals.
- Score: 20.951228068643946
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Word embedding is an essential building block for deep learning methods for
natural language processing. Although word embedding has been extensively
studied over the years, the problem of how to effectively embed numerals, a
special subset of words, is still underexplored. Existing word embedding
methods do not learn numeral embeddings well because there are an infinite
number of numerals and their individual appearances in training corpora are
highly scarce. In this paper, we propose two novel numeral embedding methods
that can handle the out-of-vocabulary (OOV) problem for numerals. We first
induce a finite set of prototype numerals using either a self-organizing map or
a Gaussian mixture model. We then represent the embedding of a numeral as a
weighted average of the prototype number embeddings. Numeral embeddings
represented in this manner can be plugged into existing word embedding learning
approaches such as skip-gram for training. We evaluated our methods and showed
its effectiveness on four intrinsic and extrinsic tasks: word similarity,
embedding numeracy, numeral prediction, and sequence labeling.
Related papers
- Number Cookbook: Number Understanding of Language Models and How to Improve It [63.9542740221096]
Large language models (LLMs) can solve an increasing number of complex reasoning tasks while making surprising mistakes in basic numerical understanding and processing.
This paper comprehensively investigates the numerical understanding and processing ability (NUPA) of LLMs.
arXiv Detail & Related papers (2024-11-06T08:59:44Z) - Laying Anchors: Semantically Priming Numerals in Language Modeling [11.831883526217942]
We introduce strategies to semantically prime numerals in any corpus by generating anchors governed by the distribution of numerals in said corpus.
We demonstrate significant improvements in the mathematical grounding of our learned embeddings.
arXiv Detail & Related papers (2024-04-02T00:02:00Z) - Greed is All You Need: An Evaluation of Tokenizer Inference Methods [4.300681074103876]
We provide a controlled analysis of seven tokenizer inference methods across four different algorithms and three vocabulary sizes.
We show that for the most commonly used tokenizers, greedy inference performs surprisingly well; and that SaGe, a recently-introduced contextually-informed tokenizer, outperforms all others on morphological alignment.
arXiv Detail & Related papers (2024-03-02T19:01:40Z) - Pixel Sentence Representation Learning [67.4775296225521]
In this work, we conceptualize the learning of sentence-level textual semantics as a visual representation learning process.
We employ visually-grounded text perturbation methods like typos and word order shuffling, resonating with human cognitive patterns, and enabling perturbation to be perceived as continuous.
Our approach is further bolstered by large-scale unsupervised topical alignment training and natural language inference supervision.
arXiv Detail & Related papers (2024-02-13T02:46:45Z) - Multi hash embeddings in spaCy [1.6790532021482656]
spaCy is a machine learning system that generates multi-embedding representations of words.
The default embedding layer in spaCy is a hash embeddings layer.
In this technical report we lay out a bit of history and introduce the embedding methods in spaCy in detail.
arXiv Detail & Related papers (2022-12-19T06:03:04Z) - Word-Level Representation From Bytes For Language Modeling [46.28198397863388]
Sub-word tokenization is not robust to noise and difficult to generalize to new languages.
We introduce a cross-attention network that builds word-level representation directly from bytes, and a sub-word level prediction based on word-level hidden states.
Byte2Word is on par with the strong sub-word baseline BERT but only takes up 10% of embedding size.
arXiv Detail & Related papers (2022-11-23T03:11:13Z) - Arithmetic-Based Pretraining -- Improving Numeracy of Pretrained
Language Models [67.48894919842576]
State-of-the-art pretrained language models tend to perform below their capabilities when applied out-of-the-box on tasks that require numeracy.
We propose a new extended pretraining approach called Arithmetic-Based Pretraining that jointly addresses both in one extended pretraining step.
Our experiments show the effectiveness of Arithmetic-Based Pretraining in three different tasks that require improved numeracy.
arXiv Detail & Related papers (2022-05-13T16:10:13Z) - Between words and characters: A Brief History of Open-Vocabulary
Modeling and Tokenization in NLP [22.772546707304766]
We show how hybrid approaches of words and characters as well as subword-based approaches based on learned segmentation have been proposed and evaluated.
We conclude that there is and likely will never be a silver bullet singular solution for all applications.
arXiv Detail & Related papers (2021-12-20T13:04:18Z) - Sent2Matrix: Folding Character Sequences in Serpentine Manifolds for
Two-Dimensional Sentence [54.6266741821988]
We propose to convert texts into 2-D representations and develop the Sent2Matrix method.
Our method allows for the explicit incorporation of both word morphologies and boundaries.
Notably, our method is the first attempt to represent texts in 2-D formats.
arXiv Detail & Related papers (2021-03-15T13:52:47Z) - Fake it Till You Make it: Self-Supervised Semantic Shifts for
Monolingual Word Embedding Tasks [58.87961226278285]
We propose a self-supervised approach to model lexical semantic change.
We show that our method can be used for the detection of semantic change with any alignment method.
We illustrate the utility of our techniques using experimental results on three different datasets.
arXiv Detail & Related papers (2021-01-30T18:59:43Z) - Can a Fruit Fly Learn Word Embeddings? [16.280120177501733]
The fruit fly brain is one of the best studied systems in neuroscience.
We show that a network motif can learn semantic representations of words and can generate both static and context-dependent word embeddings.
It is shown that not only can the fruit fly network motif achieve performance comparable to existing methods in NLP, but, additionally, it uses only a fraction of the computational resources.
arXiv Detail & Related papers (2021-01-18T05:41:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.