Unigram-Normalized Perplexity as a Language Model Performance Measure
with Different Vocabulary Sizes
- URL: http://arxiv.org/abs/2011.13220v1
- Date: Thu, 26 Nov 2020 10:39:03 GMT
- Title: Unigram-Normalized Perplexity as a Language Model Performance Measure
with Different Vocabulary Sizes
- Authors: Jihyeon Roh, Sang-Hoon Oh, Soo-Young Lee
- Abstract summary: We propose a new metric that can be used to evaluate language model performance with different vocabulary sizes.
The proposed unigram-normalized Perplexity actually presents the performance improvement of the language models from that of simple unigram model.
- Score: 4.477547027158141
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although Perplexity is a widely used performance metric for language models,
the values are highly dependent upon the number of words in the corpus and is
useful to compare performance of the same corpus only. In this paper, we
propose a new metric that can be used to evaluate language model performance
with different vocabulary sizes. The proposed unigram-normalized Perplexity
actually presents the performance improvement of the language models from that
of simple unigram model, and is robust on the vocabulary size. Both theoretical
analysis and computational experiments are reported.
Related papers
- Morphological evaluation of subwords vocabulary used by BETO language model [0.1638581561083717]
Subword tokenization algorithms are more efficient and can independently build the necessary vocabulary of words and subwords without human intervention.
In previous research, we proposed a method to assess the morphological quality of vocabularies, focusing on the overlap between these vocabularies and the morphemes of a given language.
By applying this method to vocabularies created by three subword tokenization algorithms, BPE, Wordpiece, and Unigram, we concluded that these vocabularies generally exhibit very low morphological quality.
This evaluation helps clarify the algorithm used by the tokenizer, that is, Wordpiece, given the inconsistencies between the authors' claims
arXiv Detail & Related papers (2024-10-03T08:07:14Z) - Large Vocabulary Size Improves Large Language Models [28.83786065307658]
We investigate the relationship between subword vocabulary size and the performance of large language models (LLMs)
Experimental results show that larger vocabulary sizes lead to better performance in LLMs.
We introduce a simple method to use a new vocabulary instead of the pre-defined one.
arXiv Detail & Related papers (2024-06-24T10:27:07Z) - CompoundPiece: Evaluating and Improving Decompounding Performance of
Language Models [77.45934004406283]
We systematically study decompounding, the task of splitting compound words into their constituents.
We introduce a dataset of 255k compound and non-compound words across 56 diverse languages obtained from Wiktionary.
We introduce a novel methodology to train dedicated models for decompounding.
arXiv Detail & Related papers (2023-05-23T16:32:27Z) - BenchCLAMP: A Benchmark for Evaluating Language Models on Syntactic and
Semantic Parsing [55.058258437125524]
We introduce BenchCLAMP, a Benchmark to evaluate Constrained LAnguage Model Parsing.
We benchmark eight language models, including two GPT-3 variants available only through an API.
Our experiments show that encoder-decoder pretrained language models can achieve similar performance or surpass state-of-the-art methods for syntactic and semantic parsing when the model output is constrained to be valid.
arXiv Detail & Related papers (2022-06-21T18:34:11Z) - On The Ingredients of an Effective Zero-shot Semantic Parser [95.01623036661468]
We analyze zero-shot learning by paraphrasing training examples of canonical utterances and programs from a grammar.
We propose bridging these gaps using improved grammars, stronger paraphrasers, and efficient learning methods.
Our model achieves strong performance on two semantic parsing benchmarks (Scholar, Geo) with zero labeled data.
arXiv Detail & Related papers (2021-10-15T21:41:16Z) - Discrete representations in neural models of spoken language [56.29049879393466]
We compare the merits of four commonly used metrics in the context of weakly supervised models of spoken language.
We find that the different evaluation metrics can give inconsistent results.
arXiv Detail & Related papers (2021-05-12T11:02:02Z) - Reranking Machine Translation Hypotheses with Structured and Web-based
Language Models [11.363601836199331]
Two structured language models are applied for N-best rescoring.
We find that the combination of these language models increases the BLEU score up to 1.6% absolutely on blind test sets.
arXiv Detail & Related papers (2021-04-25T22:09:03Z) - XL-WiC: A Multilingual Benchmark for Evaluating Semantic
Contextualization [98.61159823343036]
We present the Word-in-Context dataset (WiC) for assessing the ability to correctly model distinct meanings of a word.
We put forward a large multilingual benchmark, XL-WiC, featuring gold standards in 12 new languages.
Experimental results show that even when no tagged instances are available for a target language, models trained solely on the English data can attain competitive performance.
arXiv Detail & Related papers (2020-10-13T15:32:00Z) - Are Some Words Worth More than Others? [3.5598388686985354]
We propose two new intrinsic evaluation measures within the framework of a simple word prediction task.
We evaluate several commonly-used large English language models using our proposed metrics.
arXiv Detail & Related papers (2020-10-12T23:12:11Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.