I3rab: A New Arabic Dependency Treebank Based on Arabic Grammatical
Theory
- URL: http://arxiv.org/abs/2007.05772v1
- Date: Sat, 11 Jul 2020 13:34:44 GMT
- Title: I3rab: A New Arabic Dependency Treebank Based on Arabic Grammatical
Theory
- Authors: Dana Halabi, Ebaa Fayyoumi, Arafat Awajan
- Abstract summary: This paper is to construct a new Arabic dependency treebank based on the traditional Arabic grammatical theory and the characteristics of the Arabic language.
The proposed Arabic dependency treebank, called I3rab, contrasts with existing Arabic dependency treebanks in two main concepts.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Treebanks are valuable linguistic resources that include the syntactic
structure of a language sentence in addition to POS-tags and morphological
features. They are mainly utilized in modeling statistical parsers. Although
the statistical natural language parser has recently become more accurate for
languages such as English, those for the Arabic language still have low
accuracy. The purpose of this paper is to construct a new Arabic dependency
treebank based on the traditional Arabic grammatical theory and the
characteristics of the Arabic language, to investigate their effects on the
accuracy of statistical parsers. The proposed Arabic dependency treebank,
called I3rab, contrasts with existing Arabic dependency treebanks in two main
concepts. The first concept is the approach of determining the main word of the
sentence, and the second concept is the representation of the joined and covert
pronouns. To evaluate I3rab, we compared its performance against a subset of
Prague Arabic Dependency Treebank that shares a comparable level of details.
The conducted experiments show that the percentage improvement reached up to
7.5% in UAS and 18.8% in LAS.
Related papers
- Enhancing Semantic Similarity Understanding in Arabic NLP with Nested Embedding Learning [0.6752538702870792]
This work presents a novel framework for training Arabic nested embedding models through Matryoshka Embedding Learning.
Our innovative contribution includes the translation of various sentence similarity datasets into Arabic.
We trained several embedding models on the Arabic Natural Language Inference triplet dataset and assessed their performance.
arXiv Detail & Related papers (2024-07-30T19:03:03Z) - Bilingual Adaptation of Monolingual Foundation Models [48.859227944759986]
We present an efficient method for adapting a monolingual Large Language Model (LLM) to another language.
Our two-stage approach begins with expanding the vocabulary and training only the embeddings matrix.
By continually pre-training on a mix of Arabic and English corpora, the model retains its proficiency in English while acquiring capabilities in Arabic.
arXiv Detail & Related papers (2024-07-13T21:09:38Z) - ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic [51.922112625469836]
We present datasetname, the first multi-task language understanding benchmark for the Arabic language.
Our data comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA) and is carefully constructed by collaborating with native speakers in the region.
Our evaluations of 35 models reveal substantial room for improvement, particularly among the best open-source models.
arXiv Detail & Related papers (2024-02-20T09:07:41Z) - Syntactic Language Change in English and German: Metrics, Parsers, and Convergences [56.47832275431858]
The current paper looks at diachronic trends in syntactic language change in both English and German, using corpora of parliamentary debates from the last c. 160 years.
We base our observations on five dependencys, including the widely used Stanford Core as well as 4 newer alternatives.
We show that changes in syntactic measures seem to be more frequent at the tails of sentence length distributions.
arXiv Detail & Related papers (2024-02-18T11:46:16Z) - ALDi: Quantifying the Arabic Level of Dialectness of Text [17.37857915257019]
We argue that Arabic speakers perceive a spectrum of dialectness, which we operationalize at the sentence level as the Arabic Level of Dialectness (ALDi)
We provide a detailed analysis of AOC-ALDi and show that a model trained on it can effectively identify levels of dialectness on a range of other corpora.
arXiv Detail & Related papers (2023-10-20T18:07:39Z) - AceGPT, Localizing Large Language Models in Arabic [73.39989503874634]
The paper proposes a comprehensive solution that includes pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic.
The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities.
arXiv Detail & Related papers (2023-09-21T13:20:13Z) - Cross-Lingual NER for Financial Transaction Data in Low-Resource
Languages [70.25418443146435]
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data.
We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information.
With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
arXiv Detail & Related papers (2023-07-16T00:45:42Z) - Graphemic Normalization of the Perso-Arabic Script [47.429213930688086]
This paper documents the challenges that Perso-Arabic presents beyond the best-documented languages.
We focus on the situation in natural language processing (NLP), which is affected by multiple, often neglected, issues.
We evaluate the effects of script normalization on eight languages from diverse language families in the Perso-Arabic script diaspora on machine translation and statistical language modeling tasks.
arXiv Detail & Related papers (2022-10-21T21:59:44Z) - Interpreting Arabic Transformer Models [18.98681439078424]
We probe how linguistic information is encoded in Arabic pretrained models, trained on different varieties of Arabic language.
We perform a layer and neuron analysis on the models using three intrinsic tasks: two morphological tagging tasks based on MSA (modern standard Arabic) and dialectal POS-tagging and a dialectal identification task.
arXiv Detail & Related papers (2022-01-19T06:32:25Z) - Sentiment Analysis in Poems in Misurata Sub-dialect -- A Sentiment
Detection in an Arabic Sub-dialect [0.0]
This study focuses on detecting sentiment in poems written in Misurata Arabic sub-dialect spoken in Libya.
The tools used to detect sentiment from the dataset are Sklearn as well as Mazajak sentiment tool 1.
arXiv Detail & Related papers (2021-09-15T10:42:39Z) - Effect of Word Embedding Variable Parameters on Arabic Sentiment
Analysis Performance [0.0]
Social media such as Twitter, Facebook, etc. has led to a generated growing number of comments that contains users opinions.
This study will discuss three parameters (Window size, Dimension of vector and Negative Sample) for Arabic sentiment analysis.
Four binary classifiers (Logistic Regression, Decision Tree, Support Vector Machine and Naive Bayes) are used to detect sentiment.
arXiv Detail & Related papers (2021-01-08T08:31:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.