Lessons learned from the evaluation of Spanish Language Models
- URL: http://arxiv.org/abs/2212.08390v2
- Date: Fri, 22 Sep 2023 07:55:52 GMT
- Title: Lessons learned from the evaluation of Spanish Language Models
- Authors: Rodrigo Agerri and Eneko Agirre
- Abstract summary: We present a head-to-head comparison of language models for Spanish with the following results.
We argue for the need of more research to understand the factors underlying them.
The recent activity in the development of language technology for Spanish is to be welcomed, but our results show that building language models remains an open, resource-heavy problem.
- Score: 27.653133576469276
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Given the impact of language models on the field of Natural Language
Processing, a number of Spanish encoder-only masked language models (aka BERTs)
have been trained and released. These models were developed either within large
projects using very large private corpora or by means of smaller scale academic
efforts leveraging freely available data. In this paper we present a
comprehensive head-to-head comparison of language models for Spanish with the
following results: (i) Previously ignored multilingual models from large
companies fare better than monolingual models, substantially changing the
evaluation landscape of language models in Spanish; (ii) Results across the
monolingual models are not conclusive, with supposedly smaller and inferior
models performing competitively. Based on these empirical results, we argue for
the need of more research to understand the factors underlying them. In this
sense, the effect of corpus size, quality and pre-training techniques need to
be further investigated to be able to obtain Spanish monolingual models
significantly better than the multilingual ones released by large private
companies, specially in the face of rapid ongoing progress in the field. The
recent activity in the development of language technology for Spanish is to be
welcomed, but our results show that building language models remains an open,
resource-heavy problem which requires to marry resources (monetary and/or
computational) with the best research expertise and practice.
Related papers
- Language Model Knowledge Distillation for Efficient Question Answering in Spanish [16.07396492960869]
We develop SpanishTinyRoBERTa, a compressed language model based on RoBERTa for efficient question answering in Spanish.
We employ knowledge distillation from a large model onto a lighter model that allows for a wider implementation, even in areas with limited computational resources.
Experiments show that the dense distilled model can still preserve the performance of its larger counterpart, while significantly increasing inference speedup.
arXiv Detail & Related papers (2023-12-07T10:21:22Z) - Evaluating Large Language Models on Controlled Generation Tasks [92.64781370921486]
We present an extensive analysis of various benchmarks including a sentence planning benchmark with different granularities.
After comparing large language models against state-of-the-start finetuned smaller models, we present a spectrum showing large language models falling behind, are comparable, or exceed the ability of smaller models.
arXiv Detail & Related papers (2023-10-23T03:48:24Z) - Sabi\'a: Portuguese Large Language Models [14.801853435122908]
We show that monolingual pretraining on the target language significantly improves models already extensively trained on diverse corpora.
Few-shot evaluations on Poeta, a suite of 14 Portuguese datasets, reveal that our models outperform English-centric and multilingual counterparts by a significant margin.
arXiv Detail & Related papers (2023-04-16T20:11:19Z) - A Survey of Large Language Models [81.06947636926638]
Language modeling has been widely studied for language understanding and generation in the past two decades.
Recently, pre-trained language models (PLMs) have been proposed by pre-training Transformer models over large-scale corpora.
To discriminate the difference in parameter scale, the research community has coined the term large language models (LLM) for the PLMs of significant size.
arXiv Detail & Related papers (2023-03-31T17:28:46Z) - Analyzing the Mono- and Cross-Lingual Pretraining Dynamics of
Multilingual Language Models [73.11488464916668]
This study investigates the dynamics of the multilingual pretraining process.
We probe checkpoints taken from throughout XLM-R pretraining, using a suite of linguistic tasks.
Our analysis shows that the model achieves high in-language performance early on, with lower-level linguistic skills acquired before more complex ones.
arXiv Detail & Related papers (2022-05-24T03:35:00Z) - Training dataset and dictionary sizes matter in BERT models: the case of
Baltic languages [0.0]
We train a trilingual LitLat BERT-like model for Lithuanian, Latvian, and English, and a monolingual Est-RoBERTa model for Estonian.
We evaluate their performance on four downstream tasks: named entity recognition, dependency parsing, part-of-speech tagging, and word analogy.
arXiv Detail & Related papers (2021-12-20T14:26:40Z) - Language Models are Few-shot Multilingual Learners [66.11011385895195]
We evaluate the multilingual skills of the GPT and T5 models in conducting multi-class classification on non-English languages.
We show that, given a few English examples as context, pre-trained language models can predict not only English test samples but also non-English ones.
arXiv Detail & Related papers (2021-09-16T03:08:22Z) - Are Multilingual Models the Best Choice for Moderately Under-resourced
Languages? A Comprehensive Assessment for Catalan [0.05277024349608833]
This work focuses on Catalan with the aim of exploring what extent a medium-sized monolingual language model is competitive with state-of-the-art large multilingual models.
We build a clean, high-quality textual Catalan corpus (CaText), train a Transformer-based language model for Catalan (BERTa), and devise a thorough evaluation in a diversity of settings.
The result is a new benchmark, the Catalan Language Understanding Benchmark (CLUB), which we publish as an open resource.
arXiv Detail & Related papers (2021-07-16T13:52:01Z) - How Good is Your Tokenizer? On the Monolingual Performance of
Multilingual Language Models [96.32118305166412]
We study a set of nine typologically diverse languages with readily available pretrained monolingual models on a set of five diverse monolingual downstream tasks.
We find that languages which are adequately represented in the multilingual model's vocabulary exhibit negligible performance decreases over their monolingual counterparts.
arXiv Detail & Related papers (2020-12-31T14:11:00Z) - Evaluating Cross-Lingual Transfer Learning Approaches in Multilingual
Conversational Agent Models [1.52292571922932]
We propose a general multilingual model framework for Natural Language Understanding (NLU) models.
We show that these multilingual models can reach same or better performance compared to monolingual models across language-specific test data.
arXiv Detail & Related papers (2020-12-07T17:14:52Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.