Spanish Pre-trained BERT Model and Evaluation Data
- URL: http://arxiv.org/abs/2308.02976v1
- Date: Sun, 6 Aug 2023 00:16:04 GMT
- Title: Spanish Pre-trained BERT Model and Evaluation Data
- Authors: Jos\'e Ca\~nete, Gabriel Chaperon, Rodrigo Fuentes, Jou-Hui Ho, Hojin
Kang and Jorge P\'erez
- Abstract summary: We present a BERT-based language model pre-trained exclusively on Spanish data.
We also compiled several tasks specifically for the Spanish language in a single repository.
We have publicly released our model, the pre-training data, and the compilation of the Spanish benchmarks.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Spanish language is one of the top 5 spoken languages in the world.
Nevertheless, finding resources to train or evaluate Spanish language models is
not an easy task. In this paper we help bridge this gap by presenting a
BERT-based language model pre-trained exclusively on Spanish data. As a second
contribution, we also compiled several tasks specifically for the Spanish
language in a single repository much in the spirit of the GLUE benchmark. By
fine-tuning our pre-trained Spanish model, we obtain better results compared to
other BERT-based models pre-trained on multilingual corpora for most of the
tasks, even achieving a new state-of-the-art on some of them. We have publicly
released our model, the pre-training data, and the compilation of the Spanish
benchmarks.
Related papers
- BUFFET: Benchmarking Large Language Models for Few-shot Cross-lingual
Transfer [81.5984433881309]
We introduce BUFFET, which unifies 15 diverse tasks across 54 languages in a sequence-to-sequence format.
BUFFET is designed to establish a rigorous and equitable evaluation framework for few-shot cross-lingual transfer.
Our findings reveal significant room for improvement in few-shot in-context cross-lingual transfer.
arXiv Detail & Related papers (2023-05-24T08:06:33Z) - Lessons learned from the evaluation of Spanish Language Models [27.653133576469276]
We present a head-to-head comparison of language models for Spanish with the following results.
We argue for the need of more research to understand the factors underlying them.
The recent activity in the development of language technology for Spanish is to be welcomed, but our results show that building language models remains an open, resource-heavy problem.
arXiv Detail & Related papers (2022-12-16T10:33:38Z) - LERT: A Linguistically-motivated Pre-trained Language Model [67.65651497173998]
We propose LERT, a pre-trained language model that is trained on three types of linguistic features along with the original pre-training task.
We carried out extensive experiments on ten Chinese NLU tasks, and the experimental results show that LERT could bring significant improvements.
arXiv Detail & Related papers (2022-11-10T05:09:16Z) - Generalizing Multimodal Pre-training into Multilingual via Language
Acquisition [54.69707237195554]
English-based Vision-Language Pre-training has achieved great success in various downstream tasks.
Some efforts have been taken to generalize this success to non-English languages through Multilingual Vision-Language Pre-training.
We propose a textbfMultitextbfLingual textbfAcquisition (MLA) framework that can easily generalize a monolingual Vision-Language Pre-training model into multilingual.
arXiv Detail & Related papers (2022-05-29T08:53:22Z) - Analyzing the Mono- and Cross-Lingual Pretraining Dynamics of
Multilingual Language Models [73.11488464916668]
This study investigates the dynamics of the multilingual pretraining process.
We probe checkpoints taken from throughout XLM-R pretraining, using a suite of linguistic tasks.
Our analysis shows that the model achieves high in-language performance early on, with lower-level linguistic skills acquired before more complex ones.
arXiv Detail & Related papers (2022-05-24T03:35:00Z) - Pre-training Data Quality and Quantity for a Low-Resource Language: New
Corpus and BERT Models for Maltese [4.4681678689625715]
We analyse the effect of pre-training with monolingual data for a low-resource language.
We present a newly created corpus for Maltese, and determine the effect that the pre-training data size and domain have on the downstream performance.
We compare two models on the new corpus: a monolingual BERT model trained from scratch (BERTu), and a further pre-trained multilingual BERT (mBERTu)
arXiv Detail & Related papers (2022-05-21T06:44:59Z) - Evaluation Benchmarks for Spanish Sentence Representations [24.162683655834847]
We introduce Spanish SentEval and Spanish DiscoEval, aiming to assess the capabilities of stand-alone and discourse-aware sentence representations.
In addition, we evaluate and analyze the most recent pre-trained Spanish language models to exhibit their capabilities and limitations.
arXiv Detail & Related papers (2022-04-15T17:53:05Z) - Fake News Detection in Spanish Using Deep Learning Techniques [0.0]
This paper addresses the problem of fake news detection in Spanish using Machine Learning techniques.
It is fundamentally the same problem tackled for the English language.
There is not a significant amount of publicly available and adequately labeled fake news in Spanish to effectively train a Machine Learning model.
arXiv Detail & Related papers (2021-10-13T02:56:16Z) - The futility of STILTs for the classification of lexical borrowings in
Spanish [0.0]
STILTs do not provide any improvement over direct fine-tuning of multilingual models.
multilingual models trained on small subsets of languages perform reasonably better than multilingual BERT but not as good as multilingual RoBERTa for the given dataset.
arXiv Detail & Related papers (2021-09-17T15:32:02Z) - Language Models are Few-shot Multilingual Learners [66.11011385895195]
We evaluate the multilingual skills of the GPT and T5 models in conducting multi-class classification on non-English languages.
We show that, given a few English examples as context, pre-trained language models can predict not only English test samples but also non-English ones.
arXiv Detail & Related papers (2021-09-16T03:08:22Z) - Multilingual Translation with Extensible Multilingual Pretraining and
Finetuning [77.33262578776291]
Previous work has demonstrated that machine translation systems can be created by finetuning on bitext.
We show that multilingual translation models can be created through multilingual finetuning.
We demonstrate that pretrained models can be extended to incorporate additional languages without loss of performance.
arXiv Detail & Related papers (2020-08-02T05:36:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.