Arabic Fine-Grained Entity Recognition
- URL: http://arxiv.org/abs/2310.17333v2
- Date: Mon, 18 Dec 2023 18:55:49 GMT
- Title: Arabic Fine-Grained Entity Recognition
- Authors: Haneen Liqreina, Mustafa Jarrar, Mohammed Khalilia, Ahmed Oumar
El-Shangiti, Muhammad Abdul-Mageed
- Abstract summary: This article aims to advance Arabic NER with fine-grained entities.
Four main entity types in Wojood, geopolitical entity (GPE), location (LOC), organization (ORG), and facility (FAC) are extended with 31 subtypes.
To do this, we first revised Wojood's annotations of GPE, LOC, ORG, and FAC to be compatible with the LDC's ACE guidelines.
All mentions of GPE, LOC, ORG, and FAC in Wojood are manually annotated with the LDC's ACE sub-types.
- Score: 14.230912397408765
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Traditional NER systems are typically trained to recognize coarse-grained
entities, and less attention is given to classifying entities into a hierarchy
of fine-grained lower-level subtypes. This article aims to advance Arabic NER
with fine-grained entities. We chose to extend Wojood (an open-source Nested
Arabic Named Entity Corpus) with subtypes. In particular, four main entity
types in Wojood, geopolitical entity (GPE), location (LOC), organization (ORG),
and facility (FAC), are extended with 31 subtypes. To do this, we first revised
Wojood's annotations of GPE, LOC, ORG, and FAC to be compatible with the LDC's
ACE guidelines, which yielded 5, 614 changes. Second, all mentions of GPE, LOC,
ORG, and FAC (~44K) in Wojood are manually annotated with the LDC's ACE
sub-types. We refer to this extended version of Wojood as WojoodF ine. To
evaluate our annotations, we measured the inter-annotator agreement (IAA) using
both Cohen's Kappa and F1 score, resulting in 0.9861 and 0.9889, respectively.
To compute the baselines of WojoodF ine, we fine-tune three pre-trained Arabic
BERT encoders in three settings: flat NER, nested NER and nested NER with
subtypes and achieved F1 score of 0.920, 0.866, and 0.885, respectively. Our
corpus and models are open-source and available at
https://sina.birzeit.edu/wojood/.
Related papers
- mucAI at WojoodNER 2024: Arabic Named Entity Recognition with Nearest Neighbor Search [0.0]
We introduce Arabic KNN-NER, our submission to the Wojood NER Shared Task 2024 (ArabicNLP 2024)
In this paper, we tackle fine-grained flat-entity recognition for Arabic text, where we identify a single main entity and possibly zero or multiple sub-entities for each word.
Our submission achieved 91% on the test set on the WojoodFine dataset, placing Arabic KNN-NER on top of the leaderboard for the shared task.
arXiv Detail & Related papers (2024-08-07T09:34:55Z) - WojoodNER 2024: The Second Arabic Named Entity Recognition Shared Task [13.55190646427114]
WojoodNER-2024 encompassed three subtasks: (i) Closed-Track Flat Fine-Grained NER, (ii) Closed-Track Nested Fine-Grained NER, and (iii) an Open-Track NER for the Israeli War on Gaza.
The winning teams achieved F-1 scores of 91% and 92% in the Flat Fine-Grained and Nested Fine-Grained Subtasks, respectively.
arXiv Detail & Related papers (2024-07-13T16:17:08Z) - BOOST: Harnessing Black-Box Control to Boost Commonsense in LMs'
Generation [60.77990074569754]
We present a computation-efficient framework that steers a frozen Pre-Trained Language Model towards more commonsensical generation.
Specifically, we first construct a reference-free evaluator that assigns a sentence with a commonsensical score.
We then use the scorer as the oracle for commonsense knowledge, and extend the controllable generation method called NADO to train an auxiliary head.
arXiv Detail & Related papers (2023-10-25T23:32:12Z) - NERetrieve: Dataset for Next Generation Named Entity Recognition and
Retrieval [49.827932299460514]
We argue that capabilities provided by large language models are not the end of NER research, but rather an exciting beginning.
We present three variants of the NER task, together with a dataset to support them.
We provide a large, silver-annotated corpus of 4 million paragraphs covering 500 entity types.
arXiv Detail & Related papers (2023-10-22T12:23:00Z) - ANER: Arabic and Arabizi Named Entity Recognition using
Transformer-Based Approach [0.0]
We present ANER, a web-based named entity recognizer for the Arabic, and Arabizi languages.
The model is built upon BERT, which is a transformer-based encoder.
It can recognize 50 different entity classes, covering various fields.
arXiv Detail & Related papers (2023-08-28T15:54:48Z) - Recall, Expand and Multi-Candidate Cross-Encode: Fast and Accurate
Ultra-Fine Entity Typing [46.85183839946139]
State-of-the-art (SOTA) methods use the cross-encoder (CE) based architecture.
We use a novel model called MCCE to concurrently encode and score these K candidates.
We also found MCCE is very effective in fine-grained (130 types) and coarse-grained (9 types) entity typing.
arXiv Detail & Related papers (2022-12-18T16:42:52Z) - Optimizing Bi-Encoder for Named Entity Recognition via Contrastive
Learning [80.36076044023581]
We present an efficient bi-encoder framework for named entity recognition (NER)
We frame NER as a metric learning problem that maximizes the similarity between the vector representations of an entity mention and its type.
A major challenge to this bi-encoder formulation for NER lies in separating non-entity spans from entity mentions.
arXiv Detail & Related papers (2022-08-30T23:19:04Z) - Wojood: Nested Arabic Named Entity Corpus and Recognition using BERT [1.2891210250935146]
Wojood consists of 550K Modern Standard Arabic (MSA) and dialect tokens that are manually annotated with 21 entity types.
The data contains about 75K entities and 22.5% of which are nested.
Our corpus, the annotation guidelines, the source code and the pre-trained model are publicly available.
arXiv Detail & Related papers (2022-05-19T16:06:49Z) - Generalized Funnelling: Ensemble Learning and Heterogeneous Document
Embeddings for Cross-Lingual Text Classification [78.83284164605473]
emphFunnelling (Fun) is a recently proposed method for cross-lingual text classification.
We describe emphGeneralized Funnelling (gFun) as a generalization of Fun.
We show that gFun substantially improves over Fun and over state-of-the-art baselines.
arXiv Detail & Related papers (2021-09-17T23:33:04Z) - MobIE: A German Dataset for Named Entity Recognition, Entity Linking and
Relation Extraction in the Mobility Domain [76.21775236904185]
dataset consists of 3,232 social media texts and traffic reports with 91K tokens, and contains 20.5K annotated entities.
A subset of the dataset is human-annotated with seven mobility-related, n-ary relation types.
To the best of our knowledge, this is the first German-language dataset that combines annotations for NER, EL and RE.
arXiv Detail & Related papers (2021-08-16T08:21:50Z) - Automatic Difficulty Classification of Arabic Sentences [0.0]
The accuracy of our 3-way CEFR classification is F-1 of 0.80 and 0.75 for Arabic-Bert and XLM-R classification respectively and 0.71 Spearman correlation for regression.
We compare the use of sentence embeddings of different kinds (fastText, mBERT, XLM-R and Arabic-BERT) as well as traditional language features such as POS tags, dependency trees, readability scores and frequency lists for language learners.
arXiv Detail & Related papers (2021-03-07T16:02:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.