Establishing Vocabulary Tests as a Benchmark for Evaluating Large
Language Models
- URL: http://arxiv.org/abs/2310.14703v2
- Date: Mon, 29 Jan 2024 09:26:36 GMT
- Title: Establishing Vocabulary Tests as a Benchmark for Evaluating Large
Language Models
- Authors: Gonzalo Mart\'inez, Javier Conde, Elena Merino-G\'omez, Beatriz
Berm\'udez-Margaretto, Jos\'e Alberto Hern\'andez, Pedro Reviriego, Marc
Brysbaert
- Abstract summary: We advocate for the revival of vocabulary tests as a valuable tool for assessing Large Language Models (LLMs) performance.
We evaluate seven LLMs using two vocabulary test formats across two languages and uncover surprising gaps in their lexical knowledge.
- Score: 2.7013338932521416
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vocabulary tests, once a cornerstone of language modeling evaluation, have
been largely overlooked in the current landscape of Large Language Models
(LLMs) like Llama, Mistral, and GPT. While most LLM evaluation benchmarks focus
on specific tasks or domain-specific knowledge, they often neglect the
fundamental linguistic aspects of language understanding and production. In
this paper, we advocate for the revival of vocabulary tests as a valuable tool
for assessing LLM performance. We evaluate seven LLMs using two vocabulary test
formats across two languages and uncover surprising gaps in their lexical
knowledge. These findings shed light on the intricacies of LLM word
representations, their learning mechanisms, and performance variations across
models and languages. Moreover, the ability to automatically generate and
perform vocabulary tests offers new opportunities to expand the approach and
provide a more complete picture of LLMs' language skills.
Related papers
- Can LLMs Help Create Grammar?: Automating Grammar Creation for Endangered Languages with In-Context Learning [0.0]
This paper explores how Large Language Models (LLMs) can assist in generating grammatical information for low-resource languages with limited amount of data.
Our methodology involves organising the existing linguistic data and prompting to efficiently enable to generate formal XLE grammar.
This study highlights the potential of LLMs to enhance language documentation efforts, providing a cost-effective solution for generating linguistic data and contributing to the preservation of endangered languages.
arXiv Detail & Related papers (2024-12-14T20:43:12Z) - How Do Multilingual Language Models Remember Facts? [50.13632788453612]
We show that previously identified recall mechanisms in English largely apply to multilingual contexts.
We localize the role of language during recall, finding that subject enrichment is language-independent.
In decoder-only LLMs, FVs compose these two pieces of information in two separate stages.
arXiv Detail & Related papers (2024-10-18T11:39:34Z) - Understanding and Mitigating Language Confusion in LLMs [76.96033035093204]
We evaluate 15 typologically diverse languages with existing and newly-created English and multilingual prompts.
We find that Llama Instruct and Mistral models exhibit high degrees of language confusion.
We find that language confusion can be partially mitigated via few-shot prompting, multilingual SFT and preference tuning.
arXiv Detail & Related papers (2024-06-28T17:03:51Z) - FAC$^2$E: Better Understanding Large Language Model Capabilities by Dissociating Language and Cognition [56.76951887823882]
Large language models (LLMs) are primarily evaluated by overall performance on various text understanding and generation tasks.
We present FAC$2$E, a framework for Fine-grAined and Cognition-grounded LLMs' Capability Evaluation.
arXiv Detail & Related papers (2024-02-29T21:05:37Z) - Decomposed Prompting: Unveiling Multilingual Linguistic Structure
Knowledge in English-Centric Large Language Models [12.700783525558721]
English-centric Large Language Models (LLMs) like GPT-3 and LLaMA display a remarkable ability to perform multilingual tasks.
This paper introduces the decomposed prompting approach to probe the linguistic structure understanding of these LLMs in sequence labeling tasks.
arXiv Detail & Related papers (2024-02-28T15:15:39Z) - OMGEval: An Open Multilingual Generative Evaluation Benchmark for Large
Language Models [59.54423478596468]
We introduce OMGEval, the first Open-source Multilingual Generative test set that can assess the capability of LLMs in different languages.
For each language, OMGEval provides 804 open-ended questions, covering a wide range of important capabilities of LLMs.
Specifically, the current version of OMGEval includes 5 languages (i.e., Zh, Ru, Fr, Es, Ar)
arXiv Detail & Related papers (2024-02-21T04:42:41Z) - How Vocabulary Sharing Facilitates Multilingualism in LLaMA? [19.136382859468693]
Large Language Models (LLMs) often show strong performance on English tasks, while exhibiting limitations on other languages.
This study endeavors to examine the multilingual capability of LLMs from the vocabulary sharing perspective.
arXiv Detail & Related papers (2023-11-15T16:13:14Z) - L2CEval: Evaluating Language-to-Code Generation Capabilities of Large
Language Models [102.00201523306986]
We present L2CEval, a systematic evaluation of the language-to-code generation capabilities of large language models (LLMs)
We analyze the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods.
In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs.
arXiv Detail & Related papers (2023-09-29T17:57:00Z) - Spoken Language Intelligence of Large Language Models for Language Learning [3.1964044595140217]
We focus on evaluating the efficacy of large language models (LLMs) in the realm of education.
We introduce a new multiple-choice question dataset to evaluate the effectiveness of LLMs in the aforementioned scenarios.
We also investigate the influence of various prompting techniques such as zero- and few-shot method.
We find that models of different sizes have good understanding of concepts in phonetics, phonology, and second language acquisition, but show limitations in reasoning for real-world problems.
arXiv Detail & Related papers (2023-08-28T12:47:41Z) - Adapters for Enhanced Modeling of Multilingual Knowledge and Text [54.02078328453149]
Language models have been extended to multilingual language models (MLLMs)
Knowledge graphs contain facts in an explicit triple format, which require careful curation and are only available in a few high-resource languages.
We propose to enhance MLLMs with knowledge from multilingual knowledge graphs (MLKGs) so as to tackle language and knowledge graph tasks across many languages.
arXiv Detail & Related papers (2022-10-24T21:33:42Z) - A Primer on Pretrained Multilingual Language Models [18.943173499882885]
Multilingual Language Models (MLLMs) have emerged as a viable option for bringing the power of pretraining to a large number of languages.
We review the existing literature covering the above broad areas of research pertaining to MLLMs.
arXiv Detail & Related papers (2021-07-01T18:01:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.