Are Multilingual Models the Best Choice for Moderately Under-resourced
Languages? A Comprehensive Assessment for Catalan
- URL: http://arxiv.org/abs/2107.07903v1
- Date: Fri, 16 Jul 2021 13:52:01 GMT
- Title: Are Multilingual Models the Best Choice for Moderately Under-resourced
Languages? A Comprehensive Assessment for Catalan
- Authors: Jordi Armengol-Estap\'e, Casimiro Pio Carrino, Carlos
Rodriguez-Penagos, Ona de Gibert Bonet, Carme Armentano-Oller, Aitor
Gonzalez-Agirre, Maite Melero and Marta Villegas
- Abstract summary: This work focuses on Catalan with the aim of exploring what extent a medium-sized monolingual language model is competitive with state-of-the-art large multilingual models.
We build a clean, high-quality textual Catalan corpus (CaText), train a Transformer-based language model for Catalan (BERTa), and devise a thorough evaluation in a diversity of settings.
The result is a new benchmark, the Catalan Language Understanding Benchmark (CLUB), which we publish as an open resource.
- Score: 0.05277024349608833
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multilingual language models have been a crucial breakthrough as they
considerably reduce the need of data for under-resourced languages.
Nevertheless, the superiority of language-specific models has already been
proven for languages having access to large amounts of data. In this work, we
focus on Catalan with the aim to explore to what extent a medium-sized
monolingual language model is competitive with state-of-the-art large
multilingual models. For this, we: (1) build a clean, high-quality textual
Catalan corpus (CaText), the largest to date (but only a fraction of the usual
size of the previous work in monolingual language models), (2) train a
Transformer-based language model for Catalan (BERTa), and (3) devise a thorough
evaluation in a diversity of settings, comprising a complete array of
downstream tasks, namely, Part of Speech Tagging, Named Entity Recognition and
Classification, Text Classification, Question Answering, and Semantic Textual
Similarity, with most of the corresponding datasets being created ex novo. The
result is a new benchmark, the Catalan Language Understanding Benchmark (CLUB),
which we publish as an open resource, together with the clean textual corpus,
the language model, and the cleaning pipeline. Using state-of-the-art
multilingual models and a monolingual model trained only on Wikipedia as
baselines, we consistently observe the superiority of our model across tasks
and settings.
Related papers
- ColBERT-XM: A Modular Multi-Vector Representation Model for Zero-Shot
Multilingual Information Retrieval [10.664434993386523]
Current approaches circumvent the lack of high-quality labeled data in non-English languages.
We present a novel modular dense retrieval model that learns from the rich data of a single high-resource language.
arXiv Detail & Related papers (2024-02-23T02:21:24Z) - The Less the Merrier? Investigating Language Representation in
Multilingual Models [8.632506864465501]
We investigate the linguistic representation of different languages in multilingual models.
We observe from our experiments that community-centered models perform better at distinguishing between languages in the same family for low-resource languages.
arXiv Detail & Related papers (2023-10-20T02:26:34Z) - Tik-to-Tok: Translating Language Models One Token at a Time: An
Embedding Initialization Strategy for Efficient Language Adaptation [19.624330093598996]
Training monolingual language models for low and mid-resource languages is made challenging by limited and often inadequate pretraining data.
By generalizing over a word translation dictionary encompassing both the source and target languages, we map tokens from the target tokenizer to semantically similar tokens from the source language tokenizer.
We conduct experiments to convert high-resource models to mid- and low-resource languages, namely Dutch and Frisian.
arXiv Detail & Related papers (2023-10-05T11:45:29Z) - Multi-lingual Evaluation of Code Generation Models [82.7357812992118]
We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X.
These datasets cover over 10 programming languages.
We are able to assess the performance of code generation models in a multi-lingual fashion.
arXiv Detail & Related papers (2022-10-26T17:17:06Z) - How Good is Your Tokenizer? On the Monolingual Performance of
Multilingual Language Models [96.32118305166412]
We study a set of nine typologically diverse languages with readily available pretrained monolingual models on a set of five diverse monolingual downstream tasks.
We find that languages which are adequately represented in the multilingual model's vocabulary exhibit negligible performance decreases over their monolingual counterparts.
arXiv Detail & Related papers (2020-12-31T14:11:00Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - When Being Unseen from mBERT is just the Beginning: Handling New
Languages With Multilingual Language Models [2.457872341625575]
Transfer learning based on pretraining language models on a large amount of raw data has become a new norm to reach state-of-the-art performance in NLP.
We show that such models behave in multiple ways on unseen languages.
arXiv Detail & Related papers (2020-10-24T10:15:03Z) - Towards Fully Bilingual Deep Language Modeling [1.3455090151301572]
We consider whether it is possible to pre-train a bilingual model for two remotely related languages without compromising performance at either language.
We create a Finnish-English bilingual BERT model and evaluate its performance on datasets used to evaluate the corresponding monolingual models.
Our bilingual model performs on par with Google's original English BERT on GLUE and nearly matches the performance of monolingual Finnish BERT on a range of Finnish NLP tasks.
arXiv Detail & Related papers (2020-10-22T12:22:50Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z) - Learning to Scale Multilingual Representations for Vision-Language Tasks [51.27839182889422]
The effectiveness of SMALR is demonstrated with ten diverse languages, over twice the number supported in vision-language tasks to date.
We evaluate on multilingual image-sentence retrieval and outperform prior work by 3-4% with less than 1/5th the training parameters compared to other word embedding methods.
arXiv Detail & Related papers (2020-04-09T01:03:44Z) - Structure-Level Knowledge Distillation For Multilingual Sequence
Labeling [73.40368222437912]
We propose to reduce the gap between monolingual models and the unified multilingual model by distilling the structural knowledge of several monolingual models to the unified multilingual model (student)
Our experiments on 4 multilingual tasks with 25 datasets show that our approaches outperform several strong baselines and have stronger zero-shot generalizability than both the baseline model and teacher models.
arXiv Detail & Related papers (2020-04-08T07:14:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.