Are Some Words Worth More than Others?
- URL: http://arxiv.org/abs/2010.06069v2
- Date: Wed, 14 Oct 2020 03:39:59 GMT
- Title: Are Some Words Worth More than Others?
- Authors: Shiran Dudy and Steven Bedrick
- Abstract summary: We propose two new intrinsic evaluation measures within the framework of a simple word prediction task.
We evaluate several commonly-used large English language models using our proposed metrics.
- Score: 3.5598388686985354
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current evaluation metrics for language modeling and generation rely heavily
on the accuracy of predicted (or generated) words as compared to a reference
ground truth. While important, token-level accuracy only captures one aspect of
a language model's behavior, and ignores linguistic properties of words that
may allow some mis-predicted tokens to be useful in practice. Furthermore,
statistics directly tied to prediction accuracy (including perplexity) may be
confounded by the Zipfian nature of written language, as the majority of the
prediction attempts will occur with frequently-occurring types. A model's
performance may vary greatly between high- and low-frequency words, which in
practice could lead to failure modes such as repetitive and dull generated text
being produced by a downstream consumer of a language model. To address this,
we propose two new intrinsic evaluation measures within the framework of a
simple word prediction task that are designed to give a more holistic picture
of a language model's performance. We evaluate several commonly-used large
English language models using our proposed metrics, and demonstrate that our
approach reveals functional differences in performance between the models that
are obscured by more traditional metrics.
Related papers
- Robustifying Language Models with Test-Time Adaptation [17.96043752001886]
Large-scale language models achieved state-of-the-art performance over a number of language tasks.
They fail on adversarial language examples, which are sentences optimized to fool the language models but with similar semantic meanings for humans.
We show that we can reverse many language adversarial attacks by adapting the input sentence with predictions from masked words.
arXiv Detail & Related papers (2023-10-29T22:37:54Z) - Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types.
Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z) - Few-shot Subgoal Planning with Language Models [58.11102061150875]
We show that language priors encoded in pre-trained language models allow us to infer fine-grained subgoal sequences.
In contrast to recent methods which make strong assumptions about subgoal supervision, our experiments show that language models can infer detailed subgoal sequences without any fine-tuning.
arXiv Detail & Related papers (2022-05-28T01:03:30Z) - Language Models are Few-shot Multilingual Learners [66.11011385895195]
We evaluate the multilingual skills of the GPT and T5 models in conducting multi-class classification on non-English languages.
We show that, given a few English examples as context, pre-trained language models can predict not only English test samples but also non-English ones.
arXiv Detail & Related papers (2021-09-16T03:08:22Z) - Constrained Language Models Yield Few-Shot Semantic Parsers [73.50960967598654]
We explore the use of large pretrained language models as few-shot semantics.
The goal in semantic parsing is to generate a structured meaning representation given a natural language input.
We use language models to paraphrase inputs into a controlled sublanguage resembling English that can be automatically mapped to a target meaning representation.
arXiv Detail & Related papers (2021-04-18T08:13:06Z) - Unigram-Normalized Perplexity as a Language Model Performance Measure
with Different Vocabulary Sizes [4.477547027158141]
We propose a new metric that can be used to evaluate language model performance with different vocabulary sizes.
The proposed unigram-normalized Perplexity actually presents the performance improvement of the language models from that of simple unigram model.
arXiv Detail & Related papers (2020-11-26T10:39:03Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Limits of Detecting Text Generated by Large-Scale Language Models [65.46403462928319]
Some consider large-scale language models that can generate long and coherent pieces of text as dangerous, since they may be used in misinformation campaigns.
Here we formulate large-scale language model output detection as a hypothesis testing problem to classify text as genuine or generated.
arXiv Detail & Related papers (2020-02-09T19:53:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.