Assessing Linguistic Generalisation in Language Models: A Dataset for
Brazilian Portuguese
- URL: http://arxiv.org/abs/2305.14070v2
- Date: Wed, 7 Jun 2023 08:53:08 GMT
- Title: Assessing Linguistic Generalisation in Language Models: A Dataset for
Brazilian Portuguese
- Authors: Rodrigo Wilkens, Leonardo Zilio and Aline Villavicencio
- Abstract summary: We propose a set of intrinsic evaluation tasks that inspect the linguistic information encoded in models developed for Brazilian Portuguese.
These tasks are designed to evaluate how different language models generalise information related to grammatical structures and multiword expressions.
- Score: 4.941630596191806
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Much recent effort has been devoted to creating large-scale language models.
Nowadays, the most prominent approaches are based on deep neural networks, such
as BERT. However, they lack transparency and interpretability, and are often
seen as black boxes. This affects not only their applicability in downstream
tasks but also the comparability of different architectures or even of the same
model trained using different corpora or hyperparameters. In this paper, we
propose a set of intrinsic evaluation tasks that inspect the linguistic
information encoded in models developed for Brazilian Portuguese. These tasks
are designed to evaluate how different language models generalise information
related to grammatical structures and multiword expressions (MWEs), thus
allowing for an assessment of whether the model has learned different
linguistic phenomena. The dataset that was developed for these tasks is
composed of a series of sentences with a single masked word and a cue phrase
that helps in narrowing down the context. This dataset is divided into MWEs and
grammatical structures, and the latter is subdivided into 6 tasks: impersonal
verbs, subject agreement, verb agreement, nominal agreement, passive and
connectors. The subset for MWEs was used to test BERTimbau Large, BERTimbau
Base and mBERT. For the grammatical structures, we used only BERTimbau Large,
because it yielded the best results in the MWE task.
Related papers
- The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - Syntax and Semantics Meet in the "Middle": Probing the Syntax-Semantics
Interface of LMs Through Agentivity [68.8204255655161]
We present the semantic notion of agentivity as a case study for probing such interactions.
This suggests LMs may potentially serve as more useful tools for linguistic annotation, theory testing, and discovery.
arXiv Detail & Related papers (2023-05-29T16:24:01Z) - CompoundPiece: Evaluating and Improving Decompounding Performance of
Language Models [77.45934004406283]
We systematically study decompounding, the task of splitting compound words into their constituents.
We introduce a dataset of 255k compound and non-compound words across 56 diverse languages obtained from Wiktionary.
We introduce a novel methodology to train dedicated models for decompounding.
arXiv Detail & Related papers (2023-05-23T16:32:27Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - Pre-training Data Quality and Quantity for a Low-Resource Language: New
Corpus and BERT Models for Maltese [4.4681678689625715]
We analyse the effect of pre-training with monolingual data for a low-resource language.
We present a newly created corpus for Maltese, and determine the effect that the pre-training data size and domain have on the downstream performance.
We compare two models on the new corpus: a monolingual BERT model trained from scratch (BERTu), and a further pre-trained multilingual BERT (mBERTu)
arXiv Detail & Related papers (2022-05-21T06:44:59Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - AStitchInLanguageModels: Dataset and Methods for the Exploration of
Idiomaticity in Pre-Trained Language Models [7.386862225828819]
This work presents a novel dataset of naturally occurring sentences containing MWEs manually classified into a fine-grained set of meanings.
We use this dataset in two tasks designed to test i) a language model's ability to detect idiom usage, and ii) the effectiveness of a language model in generating representations of sentences containing idioms.
arXiv Detail & Related papers (2021-09-09T16:53:17Z) - Mixed Attention Transformer for LeveragingWord-Level Knowledge to Neural
Cross-Lingual Information Retrieval [15.902630454568811]
We propose a novel Mixed Attention Transformer (MAT) that incorporates external word level knowledge, such as a dictionary or translation table.
By encoding the translation knowledge into an attention matrix, the model with MAT is able to focus on the mutually translated words in the input sequence.
arXiv Detail & Related papers (2021-09-07T00:33:14Z) - WangchanBERTa: Pretraining transformer-based Thai Language Models [2.186960190193067]
We pretrain a language model based on RoBERTa-base architecture on a large, deduplicated, cleaned training set (78GB in total size)
We apply text processing rules that are specific to Thai most importantly preserving spaces.
We also experiment with word-level, syllable-level and SentencePiece tokenization with a smaller dataset to explore the effects on tokenization on downstream performance.
arXiv Detail & Related papers (2021-01-24T03:06:34Z) - Mono vs Multilingual Transformer-based Models: a Comparison across
Several Language Tasks [1.2691047660244335]
BERT (Bidirectional Representations from Transformers) and ALBERT (A Lite BERT) are methods for pre-training language models.
We make available our trained BERT and Albert model for Portuguese.
arXiv Detail & Related papers (2020-07-19T19:13:20Z) - TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data [113.29476656550342]
We present TaBERT, a pretrained LM that jointly learns representations for NL sentences and tables.
TaBERT is trained on a large corpus of 26 million tables and their English contexts.
Implementation of the model will be available at http://fburl.com/TaBERT.
arXiv Detail & Related papers (2020-05-17T17:26:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.