Fine-Tashkeel: Finetuning Byte-Level Models for Accurate Arabic Text
Diacritization
- URL: http://arxiv.org/abs/2303.14588v1
- Date: Sat, 25 Mar 2023 23:41:33 GMT
- Title: Fine-Tashkeel: Finetuning Byte-Level Models for Accurate Arabic Text
Diacritization
- Authors: Bashar Al-Rfooh, Gheith Abandah, Rami Al-Rfou
- Abstract summary: We finetune token-free pre-trained multilingual models to learn to predict and insert missing diacritics in Arabic text.
We show that we can achieve state-of-the-art on the diacritization task with minimal amount of training and no feature engineering.
- Score: 10.342180619706724
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Most of previous work on learning diacritization of the Arabic language
relied on training models from scratch. In this paper, we investigate how to
leverage pre-trained language models to learn diacritization. We finetune
token-free pre-trained multilingual models (ByT5) to learn to predict and
insert missing diacritics in Arabic text, a complex task that requires
understanding the sentence semantics and the morphological structure of the
tokens. We show that we can achieve state-of-the-art on the diacritization task
with minimal amount of training and no feature engineering, reducing WER by
40%. We release our finetuned models for the greater benefit of the researchers
in the community.
Related papers
- Parameter and Data Efficient Continual Pre-training for Robustness to
Dialectal Variance in Arabic [9.004920233490642]
We show that multilingual-BERT (mBERT) incrementally pretrained on Arabic monolingual data takes less training time and yields comparable accuracy when compared to our custom monolingual Arabic model.
We then explore two continual pre-training methods-- (1) using small amounts of dialectical data for continual finetuning and (2) parallel Arabic to English data and a Translation Language Modeling loss function.
arXiv Detail & Related papers (2022-11-08T02:51:57Z) - Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types.
Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z) - Impact of Tokenization on Language Models: An Analysis for Turkish [2.4660652494309936]
We train tokenizers and pretrain medium-sized language models using RoBERTa pretraining procedure on the Turkish split of the OSCAR corpus.
Our experiments, supported by statistical tests, reveal that Morphological-level tokenizer has challenging performance with de facto tokenizers.
We find that increasing the vocabulary size improves the performance of Morphological and Word-level tokenizers more than that of de facto tokenizers.
arXiv Detail & Related papers (2022-04-19T12:01:46Z) - Supporting Undotted Arabic with Pre-trained Language Models [0.0]
We study the effect of applying pre-trained Arabic language models on "undotted" Arabic texts.
We suggest several ways of supporting undotted texts with pre-trained models, without additional training, and measure their performance on two Arabic natural-language-processing tasks.
arXiv Detail & Related papers (2021-11-18T16:47:56Z) - Language Models are Few-shot Multilingual Learners [66.11011385895195]
We evaluate the multilingual skills of the GPT and T5 models in conducting multi-class classification on non-English languages.
We show that, given a few English examples as context, pre-trained language models can predict not only English test samples but also non-English ones.
arXiv Detail & Related papers (2021-09-16T03:08:22Z) - Filling the Gaps in Ancient Akkadian Texts: A Masked Language Modelling
Approach [8.00388161728995]
We present models which complete missing text given transliterations of ancient Mesopotamian documents.
Due to the tablets' deterioration, scholars often rely on contextual cues to manually fill in missing parts in the text.
arXiv Detail & Related papers (2021-09-09T18:58:14Z) - Lattice-BERT: Leveraging Multi-Granularity Representations in Chinese
Pre-trained Language Models [62.41139712595334]
We propose a novel pre-training paradigm for Chinese -- Lattice-BERT.
We construct a lattice graph from the characters and words in a sentence and feed all these text units into transformers.
We show that our model can bring an average increase of 1.5% under the 12-layer setting.
arXiv Detail & Related papers (2021-04-15T02:36:49Z) - AraELECTRA: Pre-Training Text Discriminators for Arabic Language
Understanding [0.0]
We develop an Arabic language representation model, which we name AraELECTRA.
Our model is pretrained using the replaced token detection objective on large Arabic text corpora.
We show that AraELECTRA outperforms current state-of-the-art Arabic language representation models, given the same pretraining data and with even a smaller model size.
arXiv Detail & Related papers (2020-12-31T09:35:39Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z) - Exploring Fine-tuning Techniques for Pre-trained Cross-lingual Models
via Continual Learning [74.25168207651376]
Fine-tuning pre-trained language models to downstream cross-lingual tasks has shown promising results.
We leverage continual learning to preserve the cross-lingual ability of the pre-trained model when we fine-tune it to downstream tasks.
Our methods achieve better performance than other fine-tuning baselines on the zero-shot cross-lingual part-of-speech tagging and named entity recognition tasks.
arXiv Detail & Related papers (2020-04-29T14:07:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.