Supporting Undotted Arabic with Pre-trained Language Models
- URL: http://arxiv.org/abs/2111.09791v1
- Date: Thu, 18 Nov 2021 16:47:56 GMT
- Title: Supporting Undotted Arabic with Pre-trained Language Models
- Authors: Aviad Rom and Kfir Bar
- Abstract summary: We study the effect of applying pre-trained Arabic language models on "undotted" Arabic texts.
We suggest several ways of supporting undotted texts with pre-trained models, without additional training, and measure their performance on two Arabic natural-language-processing tasks.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We observe a recent behaviour on social media, in which users intentionally
remove consonantal dots from Arabic letters, in order to bypass
content-classification algorithms. Content classification is typically done by
fine-tuning pre-trained language models, which have been recently employed by
many natural-language-processing applications. In this work we study the effect
of applying pre-trained Arabic language models on "undotted" Arabic texts. We
suggest several ways of supporting undotted texts with pre-trained models,
without additional training, and measure their performance on two Arabic
natural-language-processing downstream tasks. The results are encouraging; in
one of the tasks our method shows nearly perfect performance.
Related papers
- Bilingual Adaptation of Monolingual Foundation Models [48.859227944759986]
We present an efficient method for adapting a monolingual Large Language Model (LLM) to another language.
Our two-stage approach begins with expanding the vocabulary and training only the embeddings matrix.
By continually pre-training on a mix of Arabic and English corpora, the model retains its proficiency in English while acquiring capabilities in Arabic.
arXiv Detail & Related papers (2024-07-13T21:09:38Z) - Pixel Sentence Representation Learning [67.4775296225521]
In this work, we conceptualize the learning of sentence-level textual semantics as a visual representation learning process.
We employ visually-grounded text perturbation methods like typos and word order shuffling, resonating with human cognitive patterns, and enabling perturbation to be perceived as continuous.
Our approach is further bolstered by large-scale unsupervised topical alignment training and natural language inference supervision.
arXiv Detail & Related papers (2024-02-13T02:46:45Z) - On the importance of Data Scale in Pretraining Arabic Language Models [46.431706010614334]
We conduct a comprehensive study on the role of data in Arabic Pretrained Language Models (PLMs)
We reassess the performance of a suite of state-of-the-art Arabic PLMs by retraining them on massive-scale, high-quality Arabic corpora.
Our analysis strongly suggests that pretraining data by far is the primary contributor to performance, surpassing other factors.
arXiv Detail & Related papers (2024-01-15T15:11:15Z) - Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation [82.5217996570387]
We adapt a pre-trained language model for auto-regressive text-to-image generation.
We find that pre-trained language models offer limited help.
arXiv Detail & Related papers (2023-11-27T07:19:26Z) - Fine-Tashkeel: Finetuning Byte-Level Models for Accurate Arabic Text
Diacritization [10.342180619706724]
We finetune token-free pre-trained multilingual models to learn to predict and insert missing diacritics in Arabic text.
We show that we can achieve state-of-the-art on the diacritization task with minimal amount of training and no feature engineering.
arXiv Detail & Related papers (2023-03-25T23:41:33Z) - Paraphrastic Representations at Scale [134.41025103489224]
We release trained models for English, Arabic, German, French, Spanish, Russian, Turkish, and Chinese languages.
We train these models on large amounts of data, achieving significantly improved performance from the original papers.
arXiv Detail & Related papers (2021-04-30T16:55:28Z) - AraELECTRA: Pre-Training Text Discriminators for Arabic Language
Understanding [0.0]
We develop an Arabic language representation model, which we name AraELECTRA.
Our model is pretrained using the replaced token detection objective on large Arabic text corpora.
We show that AraELECTRA outperforms current state-of-the-art Arabic language representation models, given the same pretraining data and with even a smaller model size.
arXiv Detail & Related papers (2020-12-31T09:35:39Z) - Can Multilingual Language Models Transfer to an Unseen Dialect? A Case
Study on North African Arabizi [2.76240219662896]
We study the ability of multilingual language models to process an unseen dialect.
We take user generated North-African Arabic as our case study.
We show in zero-shot and unsupervised adaptation scenarios that multilingual language models are able to transfer to such an unseen dialect.
arXiv Detail & Related papers (2020-05-01T11:29:23Z) - Exploring Fine-tuning Techniques for Pre-trained Cross-lingual Models
via Continual Learning [74.25168207651376]
Fine-tuning pre-trained language models to downstream cross-lingual tasks has shown promising results.
We leverage continual learning to preserve the cross-lingual ability of the pre-trained model when we fine-tune it to downstream tasks.
Our methods achieve better performance than other fine-tuning baselines on the zero-shot cross-lingual part-of-speech tagging and named entity recognition tasks.
arXiv Detail & Related papers (2020-04-29T14:07:18Z) - BERT Fine-tuning For Arabic Text Summarization [0.0]
Our model works with multilingual BERT (as Arabic language does not have a pretrained BERT of its own)
We show its performance in English corpus first before applying it to Arabic corpora in both extractive and abstractive tasks.
arXiv Detail & Related papers (2020-03-29T20:23:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.