Evaluating Contextualized Language Models for Hungarian
- URL: http://arxiv.org/abs/2102.10848v1
- Date: Mon, 22 Feb 2021 09:29:01 GMT
- Title: Evaluating Contextualized Language Models for Hungarian
- Authors: Judit \'Acs and D\'aniel L\'evai and D\'avid M\'ark Nemeskey and
Andr\'as Kornai
- Abstract summary: We compare huBERT, a Hungarian model against 4 multilingual models including the multilingual BERT model.
We find that huBERT works better than the other models, often by a large margin, particularly near the global optimum.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present an extended comparison of contextualized language models for
Hungarian. We compare huBERT, a Hungarian model against 4 multilingual models
including the multilingual BERT model. We evaluate these models through three
tasks, morphological probing, POS tagging and NER. We find that huBERT works
better than the other models, often by a large margin, particularly near the
global optimum (typically at the middle layers). We also find that huBERT tends
to generate fewer subwords for one word and that using the last subword for
token-level tasks is generally a better choice than using the first one.
Related papers
- False Friends Are Not Foes: Investigating Vocabulary Overlap in Multilingual Language Models [53.01170039144264]
Subword tokenizers trained on multilingual corpora naturally produce overlapping tokens across languages.<n>Does token overlap facilitate cross-lingual transfer or instead introduce interference between languages?<n>We find that models with overlap outperform models with disjoint vocabularies.
arXiv Detail & Related papers (2025-09-23T07:47:54Z) - mmBERT: A Modern Multilingual Encoder with Annealed Language Learning [57.58071656545661]
mmBERT is an encoder-only language model pretrained on 3T tokens of multilingual text.<n>We add over 1700 low-resource languages to the data mix only during the decay phase.<n>We show that mmBERT significantly outperforms the previous generation of models on classification and retrieval tasks.
arXiv Detail & Related papers (2025-09-08T17:08:42Z) - Multilingual Text-to-Image Generation Magnifies Gender Stereotypes and Prompt Engineering May Not Help You [64.74707085021858]
We show that multilingual models suffer from significant gender biases just as monolingual models do.
We propose a novel benchmark, MAGBIG, intended to foster research on gender bias in multilingual models.
Our results show that not only do models exhibit strong gender biases but they also behave differently across languages.
arXiv Detail & Related papers (2024-01-29T12:02:28Z) - On the Analysis of Cross-Lingual Prompt Tuning for Decoder-based
Multilingual Model [49.81429697921861]
We study the interaction between parameter-efficient fine-tuning (PEFT) and cross-lingual tasks in multilingual autoregressive models.
We show that prompt tuning is more effective in enhancing the performance of low-resource languages than fine-tuning.
arXiv Detail & Related papers (2023-11-14T00:43:33Z) - Evaluating Large Language Models on Controlled Generation Tasks [92.64781370921486]
We present an extensive analysis of various benchmarks including a sentence planning benchmark with different granularities.
After comparing large language models against state-of-the-start finetuned smaller models, we present a spectrum showing large language models falling behind, are comparable, or exceed the ability of smaller models.
arXiv Detail & Related papers (2023-10-23T03:48:24Z) - ur-iw-hnt at GermEval 2021: An Ensembling Strategy with Multiple BERT
Models [5.952826555378035]
We submitted three runs using an ensembling strategy by majority (hard) voting with multiple different BERT models.
All ensemble models outperform single models, while BERTweet is the winner of all individual models in every subtask.
Twitter-based models perform better than GermanBERT models, and multilingual models perform worse but by a small margin.
arXiv Detail & Related papers (2021-10-05T13:48:20Z) - gaBERT -- an Irish Language Model [7.834915319072005]
gaBERT is a monolingual BERT model for the Irish language.
We show how different filtering criteria, vocabulary size and the choice of subword tokenisation model affect downstream performance.
arXiv Detail & Related papers (2021-07-27T16:38:53Z) - Lattice-BERT: Leveraging Multi-Granularity Representations in Chinese
Pre-trained Language Models [62.41139712595334]
We propose a novel pre-training paradigm for Chinese -- Lattice-BERT.
We construct a lattice graph from the characters and words in a sentence and feed all these text units into transformers.
We show that our model can bring an average increase of 1.5% under the 12-layer setting.
arXiv Detail & Related papers (2021-04-15T02:36:49Z) - GottBERT: a pure German Language Model [0.0]
No German single language RoBERTa model is yet published, which we introduce in this work (GottBERT)
In an evaluation we compare its performance on the two Named Entity Recognition (NER) tasks Conll 2003 and GermEval 2014 as well as on the text classification tasks GermEval 2018 (fine and coarse) and GNAD with existing German single language BERT models and two multilingual ones.
GottBERT was successfully pre-trained on a 256 core TPU pod using the RoBERTa BASE architecture.
arXiv Detail & Related papers (2020-12-03T17:45:03Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - KR-BERT: A Small-Scale Korean-Specific Language Model [0.0]
We trained a Korean-specific model KR-BERT, utilizing a smaller vocabulary and dataset.
Our model performed comparably and even better than other existing pre-trained models using a corpus about 1/10 of the size.
arXiv Detail & Related papers (2020-08-10T09:26:00Z) - WikiBERT models: deep transfer learning for many languages [1.3455090151301572]
We introduce a simple, fully automated pipeline for creating languagespecific BERT models from Wikipedia data.
We assess the merits of these models using the state-of-the-art UDify on Universal Dependencies data.
arXiv Detail & Related papers (2020-06-02T11:57:53Z) - Structure-Level Knowledge Distillation For Multilingual Sequence
Labeling [73.40368222437912]
We propose to reduce the gap between monolingual models and the unified multilingual model by distilling the structural knowledge of several monolingual models to the unified multilingual model (student)
Our experiments on 4 multilingual tasks with 25 datasets show that our approaches outperform several strong baselines and have stronger zero-shot generalizability than both the baseline model and teacher models.
arXiv Detail & Related papers (2020-04-08T07:14:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.