GREEK-BERT: The Greeks visiting Sesame Street
- URL: http://arxiv.org/abs/2008.12014v2
- Date: Thu, 3 Sep 2020 08:41:35 GMT
- Title: GREEK-BERT: The Greeks visiting Sesame Street
- Authors: John Koutsikakis, Ilias Chalkidis, Prodromos Malakasiotis and Ion
Androutsopoulos
- Abstract summary: Transformer-based language models, such as BERT, have achieved state-of-the-art performance in several downstream natural language processing tasks.
We present GREEK-BERT, a monolingual BERT-based language model for modern Greek.
- Score: 25.406207104603027
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based language models, such as BERT and its variants, have
achieved state-of-the-art performance in several downstream natural language
processing (NLP) tasks on generic benchmark datasets (e.g., GLUE, SQUAD, RACE).
However, these models have mostly been applied to the resource-rich English
language. In this paper, we present GREEK-BERT, a monolingual BERT-based
language model for modern Greek. We evaluate its performance in three NLP
tasks, i.e., part-of-speech tagging, named entity recognition, and natural
language inference, obtaining state-of-the-art performance. Interestingly, in
two of the benchmarks GREEK-BERT outperforms two multilingual Transformer-based
models (M-BERT, XLM-R), as well as shallower neural baselines operating on
pre-trained word embeddings, by a large margin (5%-10%). Most importantly, we
make both GREEK-BERT and our training code publicly available, along with code
illustrating how GREEK-BERT can be fine-tuned for downstream NLP tasks. We
expect these resources to boost NLP research and applications for modern Greek.
Related papers
- A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - Fine-tuning Transformer-based Encoder for Turkish Language Understanding
Tasks [0.0]
We provide a Transformer-based model and a baseline benchmark for the Turkish Language.
We successfully fine-tuned a Turkish BERT model, namely BERTurk, to many downstream tasks and evaluated with a the Turkish Benchmark dataset.
arXiv Detail & Related papers (2024-01-30T19:27:04Z) - OYXOY: A Modern NLP Test Suite for Modern Greek [2.059776592203642]
This paper serves as a foundational step towards the development of a linguistically motivated evaluation suite for Greek NLP.
We introduce four expert-verified evaluation tasks, specifically targeted at natural language inference, word sense disambiguation and metaphor detection.
More than language-resourced replicas of existing tasks, we contribute two innovations which will resonate with the broader resource and evaluation community.
arXiv Detail & Related papers (2023-09-13T15:00:56Z) - GreekBART: The First Pretrained Greek Sequence-to-Sequence Model [13.429669368275318]
We introduce GreekBART, the first Seq2Seq model based on BART-base architecture and pretrained on a large-scale Greek corpus.
We evaluate and compare GreekBART against BART-random, Greek-BERT, and XLM-R on a variety of discriminative tasks.
arXiv Detail & Related papers (2023-04-03T10:48:51Z) - KinyaBERT: a Morphology-aware Kinyarwanda Language Model [1.2183405753834562]
Unsupervised sub-word tokenization methods are sub-optimal at handling morphologically rich languages.
We propose a simple yet effective two-tier BERT architecture that leverages a morphological analyzer and explicitly represents morphological compositionality.
We evaluate our proposed method on the low-resource morphologically rich Kinyarwanda language, naming the proposed model architecture KinyaBERT.
arXiv Detail & Related papers (2022-03-16T08:36:14Z) - A Unified Strategy for Multilingual Grammatical Error Correction with
Pre-trained Cross-Lingual Language Model [100.67378875773495]
We propose a generic and language-independent strategy for multilingual Grammatical Error Correction.
Our approach creates diverse parallel GEC data without any language-specific operations.
It achieves the state-of-the-art results on the NLPCC 2018 Task 2 dataset (Chinese) and obtains competitive performance on Falko-Merlin (German) and RULEC-GEC (Russian)
arXiv Detail & Related papers (2022-01-26T02:10:32Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - GLGE: A New General Language Generation Evaluation Benchmark [139.25515221280767]
General Language Generation Evaluation (GLGE) is a new multi-task benchmark for evaluating the generalization capabilities of NLG models.
To encourage research on pretraining and transfer learning on NLG models, we make GLGE publicly available and build a leaderboard with strong baselines.
arXiv Detail & Related papers (2020-11-24T06:59:45Z) - Explicit Alignment Objectives for Multilingual Bidirectional Encoders [111.65322283420805]
We present a new method for learning multilingual encoders, AMBER (Aligned Multilingual Bi-directional EncodeR)
AMBER is trained on additional parallel data using two explicit alignment objectives that align the multilingual representations at different granularities.
Experimental results show that AMBER obtains gains of up to 1.1 average F1 score on sequence tagging and up to 27.3 average accuracy on retrieval over the XLMR-large model.
arXiv Detail & Related papers (2020-10-15T18:34:13Z) - ParsBERT: Transformer-based Model for Persian Language Understanding [0.7646713951724012]
This paper proposes a monolingual BERT for the Persian language (ParsBERT)
It shows its state-of-the-art performance compared to other architectures and multilingual models.
ParsBERT obtains higher scores in all datasets, including existing ones as well as composed ones.
arXiv Detail & Related papers (2020-05-26T05:05:32Z) - Coreferential Reasoning Learning for Language Representation [88.14248323659267]
We present CorefBERT, a novel language representation model that can capture the coreferential relations in context.
The experimental results show that, compared with existing baseline models, CorefBERT can achieve significant improvements consistently on various downstream NLP tasks.
arXiv Detail & Related papers (2020-04-15T03:57:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.