Give your Text Representation Models some Love: the Case for Basque
- URL: http://arxiv.org/abs/2004.00033v2
- Date: Thu, 2 Apr 2020 11:46:52 GMT
- Title: Give your Text Representation Models some Love: the Case for Basque
- Authors: Rodrigo Agerri, I\~naki San Vicente, Jon Ander Campos, Ander Barrena,
Xabier Saralegi, Aitor Soroa, Eneko Agirre
- Abstract summary: Word embeddings and pre-trained language models allow to build rich representations of text.
Many small companies and research groups tend to use models that have been pre-trained and made available by third parties.
This is suboptimal as, for many languages, the models have been trained on smaller (or lower quality) corpora.
We show that a number of monolingual models trained with larger Basque corpora produce much better results than publicly available versions in downstream NLP tasks.
- Score: 24.76979832867631
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Word embeddings and pre-trained language models allow to build rich
representations of text and have enabled improvements across most NLP tasks.
Unfortunately they are very expensive to train, and many small companies and
research groups tend to use models that have been pre-trained and made
available by third parties, rather than building their own. This is suboptimal
as, for many languages, the models have been trained on smaller (or lower
quality) corpora. In addition, monolingual pre-trained models for non-English
languages are not always available. At best, models for those languages are
included in multilingual versions, where each language shares the quota of
substrings and parameters with the rest of the languages. This is particularly
true for smaller languages such as Basque. In this paper we show that a number
of monolingual models (FastText word embeddings, FLAIR and BERT language
models) trained with larger Basque corpora produce much better results than
publicly available versions in downstream NLP tasks, including topic
classification, sentiment classification, PoS tagging and NER. This work sets a
new state-of-the-art in those tasks for Basque. All benchmarks and models used
in this work are publicly available.
Related papers
- Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - Generalizing Multimodal Pre-training into Multilingual via Language
Acquisition [54.69707237195554]
English-based Vision-Language Pre-training has achieved great success in various downstream tasks.
Some efforts have been taken to generalize this success to non-English languages through Multilingual Vision-Language Pre-training.
We propose a textbfMultitextbfLingual textbfAcquisition (MLA) framework that can easily generalize a monolingual Vision-Language Pre-training model into multilingual.
arXiv Detail & Related papers (2022-05-29T08:53:22Z) - WECHSEL: Effective initialization of subword embeddings for
cross-lingual transfer of monolingual language models [3.6878069324996616]
We introduce a method -- called WECHSEL -- to transfer English models to new languages.
We use WECHSEL to transfer GPT-2 and RoBERTa models to 4 other languages.
arXiv Detail & Related papers (2021-12-13T12:26:02Z) - Continual Learning in Multilingual NMT via Language-Specific Embeddings [92.91823064720232]
It consists in replacing the shared vocabulary with a small language-specific vocabulary and fine-tuning the new embeddings on the new language's parallel data.
Because the parameters of the original model are not modified, its performance on the initial languages does not degrade.
arXiv Detail & Related papers (2021-10-20T10:38:57Z) - Are Multilingual Models the Best Choice for Moderately Under-resourced
Languages? A Comprehensive Assessment for Catalan [0.05277024349608833]
This work focuses on Catalan with the aim of exploring what extent a medium-sized monolingual language model is competitive with state-of-the-art large multilingual models.
We build a clean, high-quality textual Catalan corpus (CaText), train a Transformer-based language model for Catalan (BERTa), and devise a thorough evaluation in a diversity of settings.
The result is a new benchmark, the Catalan Language Understanding Benchmark (CLUB), which we publish as an open resource.
arXiv Detail & Related papers (2021-07-16T13:52:01Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Multilingual Translation with Extensible Multilingual Pretraining and
Finetuning [77.33262578776291]
Previous work has demonstrated that machine translation systems can be created by finetuning on bitext.
We show that multilingual translation models can be created through multilingual finetuning.
We demonstrate that pretrained models can be extended to incorporate additional languages without loss of performance.
arXiv Detail & Related papers (2020-08-02T05:36:55Z) - WikiBERT models: deep transfer learning for many languages [1.3455090151301572]
We introduce a simple, fully automated pipeline for creating languagespecific BERT models from Wikipedia data.
We assess the merits of these models using the state-of-the-art UDify on Universal Dependencies data.
arXiv Detail & Related papers (2020-06-02T11:57:53Z) - ParsBERT: Transformer-based Model for Persian Language Understanding [0.7646713951724012]
This paper proposes a monolingual BERT for the Persian language (ParsBERT)
It shows its state-of-the-art performance compared to other architectures and multilingual models.
ParsBERT obtains higher scores in all datasets, including existing ones as well as composed ones.
arXiv Detail & Related papers (2020-05-26T05:05:32Z) - Making Monolingual Sentence Embeddings Multilingual using Knowledge
Distillation [73.65237422910738]
We present an easy and efficient method to extend existing sentence embedding models to new languages.
This allows to create multilingual versions from previously monolingual models.
arXiv Detail & Related papers (2020-04-21T08:20:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.