Related papers: LMentry: A Language Model Benchmark of Elementary Language Tasks

LMentry: A Language Model Benchmark of Elementary Language Tasks

URL: http://arxiv.org/abs/2211.02069v1
Date: Thu, 3 Nov 2022 18:01:12 GMT
Title: LMentry: A Language Model Benchmark of Elementary Language Tasks
Authors: Avia Efrat, Or Honovich, Omer Levy
Abstract summary: LMentry is a benchmark that focuses on a compact set of tasks that are trivial to humans. It provides insights into the capabilities and robustness of large language models.
Score: 39.71352171304755
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As the performance of large language models rapidly improves, benchmarks are getting larger and more complex as well. We present LMentry, a benchmark that avoids this "arms race" by focusing on a compact set of tasks that are trivial to humans, e.g. writing a sentence containing a specific word, identifying which words in a list belong to a specific category, or choosing which of two words is longer. LMentry is specifically designed to provide quick and interpretable insights into the capabilities and robustness of large language models. Our experiments reveal a wide variety of failure cases that, while immediately obvious to humans, pose a considerable challenge for large language models, including OpenAI's latest 175B-parameter instruction-tuned model, TextDavinci002. LMentry complements contemporary evaluation approaches of large language models, providing a quick, automatic, and easy-to-run "unit test", without resorting to large benchmark suites of complex tasks.

Related papers

Evaluating Robustness of Large Language Models Against Multilingual Typographical Errors [45.37878669586302]
Large language models (LLMs) are increasingly deployed in multilingual, real-world applications with user inputs.<n>Most benchmarks assume clean input, leaving the robustness of LLMs to typos largely underexplored.<n>We introduce MulTypo, a multilingual typo generation algorithm that simulates human-like errors based on language-specific keyboard layouts and typing behavior.
arXiv Detail & Related papers (2025-10-10T16:49:12Z)
Understanding and Mitigating Language Confusion in LLMs [76.96033035093204]
We evaluate 15 typologically diverse languages with existing and newly-created English and multilingual prompts. We find that Llama Instruct and Mistral models exhibit high degrees of language confusion. We find that language confusion can be partially mitigated via few-shot prompting, multilingual SFT and preference tuning.
arXiv Detail & Related papers (2024-06-28T17:03:51Z)
Accelerating Multilingual Language Model for Excessively Tokenized Languages [3.5570874721859016]
tokenizers in large language models (LLMs) often fragment a text into character or Unicode-level tokens in non-Roman alphabetic languages. We introduce a simple yet effective framework to accelerate text generation in such languages.
arXiv Detail & Related papers (2024-01-19T12:26:57Z)
Evaluating Large Language Models on Controlled Generation Tasks [92.64781370921486]
We present an extensive analysis of various benchmarks including a sentence planning benchmark with different granularities. After comparing large language models against state-of-the-start finetuned smaller models, we present a spectrum showing large language models falling behind, are comparable, or exceed the ability of smaller models.
arXiv Detail & Related papers (2023-10-23T03:48:24Z)
Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally. Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z)
Are Large Language Models Robust Coreference Resolvers? [17.60248310475889]
We show that prompting for coreference can outperform current unsupervised coreference systems. Further investigations reveal that instruction-tuned LMs generalize surprisingly well across domains, languages, and time periods.
arXiv Detail & Related papers (2023-05-23T19:38:28Z)
Interpretable Unified Language Checking [42.816372695828306]
We present an interpretable, unified, language checking (UniLC) method for both human and machine-generated language. We find that LLMs can achieve high performance on a combination of fact-checking, stereotype detection, and hate speech detection tasks.
arXiv Detail & Related papers (2023-04-07T16:47:49Z)
An Application of Pseudo-Log-Likelihoods to Natural Language Scoring [5.382454613390483]
A language model with relatively few parameters and training steps can outperform it on a recent large data set. We produce some absolute state-of-the-art results for common sense reasoning in binary choice tasks. We argue that robustness of the smaller model ought to be understood in terms of compositionality.
arXiv Detail & Related papers (2022-01-23T22:00:54Z)
Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks. We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations. All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z)
Multitask Prompted Training Enables Zero-Shot Task Generalization [70.12770442071657]
We develop a system for mapping general natural language tasks into a human-readable prompted form. We fine-tune a pretrained encoder-decoder model on this multitask mixture covering a wide variety of tasks. The model attains strong zero-shot performance on several standard datasets, often outperforming models 16x its size.
arXiv Detail & Related papers (2021-10-15T17:08:57Z)
Pre-training Universal Language Representation [46.51685959045527]
This work introduces universal language representation learning, i.e., embeddings of different levels of linguistic units or text with quite diverse lengths in a uniform vector space. We empirically verify that well designed pre-training scheme may effectively yield universal language representation.
arXiv Detail & Related papers (2021-05-30T09:29:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.