AraELECTRA: Pre-Training Text Discriminators for Arabic Language
Understanding
- URL: http://arxiv.org/abs/2012.15516v2
- Date: Sun, 7 Mar 2021 13:23:41 GMT
- Title: AraELECTRA: Pre-Training Text Discriminators for Arabic Language
Understanding
- Authors: Wissam Antoun, Fady Baly, Hazem Hajj
- Abstract summary: We develop an Arabic language representation model, which we name AraELECTRA.
Our model is pretrained using the replaced token detection objective on large Arabic text corpora.
We show that AraELECTRA outperforms current state-of-the-art Arabic language representation models, given the same pretraining data and with even a smaller model size.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Advances in English language representation enabled a more sample-efficient
pre-training task by Efficiently Learning an Encoder that Classifies Token
Replacements Accurately (ELECTRA). Which, instead of training a model to
recover masked tokens, it trains a discriminator model to distinguish true
input tokens from corrupted tokens that were replaced by a generator network.
On the other hand, current Arabic language representation approaches rely only
on pretraining via masked language modeling. In this paper, we develop an
Arabic language representation model, which we name AraELECTRA. Our model is
pretrained using the replaced token detection objective on large Arabic text
corpora. We evaluate our model on multiple Arabic NLP tasks, including reading
comprehension, sentiment analysis, and named-entity recognition and we show
that AraELECTRA outperforms current state-of-the-art Arabic language
representation models, given the same pretraining data and with even a smaller
model size.
Related papers
- Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models [4.165536532090932]
The disconnect between tokenizer creation and model training in language models allows for specific inputs, such as the infamous SolidGoldMagikarp token, to induce unwanted model behaviour.
We present a comprehensive analysis of Large Language Model tokenizers, specifically targeting this issue of detecting under-trained tokens.
Through a combination of tokenizer analysis, model weight-based indicators, and prompting techniques, we develop novel and effective methods for automatically detecting these problematic tokens.
arXiv Detail & Related papers (2024-05-08T20:37:56Z) - Training a Bilingual Language Model by Mapping Tokens onto a Shared
Character Space [2.9914612342004503]
We train a bilingual Arabic-Hebrew language model using a transliterated version of Arabic texts in Hebrew.
We assess the performance of a language model that employs a unified script for both languages, on machine translation.
arXiv Detail & Related papers (2024-02-25T11:26:39Z) - On the importance of Data Scale in Pretraining Arabic Language Models [46.431706010614334]
We conduct a comprehensive study on the role of data in Arabic Pretrained Language Models (PLMs)
We reassess the performance of a suite of state-of-the-art Arabic PLMs by retraining them on massive-scale, high-quality Arabic corpora.
Our analysis strongly suggests that pretraining data by far is the primary contributor to performance, surpassing other factors.
arXiv Detail & Related papers (2024-01-15T15:11:15Z) - Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation [82.5217996570387]
We adapt a pre-trained language model for auto-regressive text-to-image generation.
We find that pre-trained language models offer limited help.
arXiv Detail & Related papers (2023-11-27T07:19:26Z) - Fine-Tashkeel: Finetuning Byte-Level Models for Accurate Arabic Text
Diacritization [10.342180619706724]
We finetune token-free pre-trained multilingual models to learn to predict and insert missing diacritics in Arabic text.
We show that we can achieve state-of-the-art on the diacritization task with minimal amount of training and no feature engineering.
arXiv Detail & Related papers (2023-03-25T23:41:33Z) - GanLM: Encoder-Decoder Pre-training with an Auxiliary Discriminator [114.8954615026781]
We propose a GAN-style model for encoder-decoder pre-training by introducing an auxiliary discriminator.
GanLM is trained with two pre-training objectives: replaced token detection and replaced token denoising.
Experiments in language generation benchmarks show that GanLM with the powerful language understanding capability outperforms various strong pre-trained language models.
arXiv Detail & Related papers (2022-12-20T12:51:11Z) - Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types.
Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z) - Quark: Controllable Text Generation with Reinforced Unlearning [68.07749519374089]
Large-scale language models often learn behaviors that are misaligned with user expectations.
We introduce Quantized Reward Konditioning (Quark), an algorithm for optimizing a reward function that quantifies an (un)wanted property.
For unlearning toxicity, negative sentiment, and repetition, our experiments show that Quark outperforms both strong baselines and state-of-the-art reinforcement learning methods.
arXiv Detail & Related papers (2022-05-26T21:11:51Z) - Sentence Bottleneck Autoencoders from Transformer Language Models [53.350633961266375]
We build a sentence-level autoencoder from a pretrained, frozen transformer language model.
We adapt the masked language modeling objective as a generative, denoising one, while only training a sentence bottleneck and a single-layer modified transformer decoder.
We demonstrate that the sentence representations discovered by our model achieve better quality than previous methods that extract representations from pretrained transformers on text similarity tasks, style transfer, and single-sentence classification tasks in the GLUE benchmark, while using fewer parameters than large pretrained models.
arXiv Detail & Related papers (2021-08-31T19:39:55Z) - Self-Training Pre-Trained Language Models for Zero- and Few-Shot
Multi-Dialectal Arabic Sequence Labeling [7.310390479801139]
Self-train pre-trained language models in zero- and few-shot scenarios to improve performance on data-scarce varieties.
Our work opens up opportunities for developing DA models exploiting only MSA resources.
arXiv Detail & Related papers (2021-01-12T21:29:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.