Wojood: Nested Arabic Named Entity Corpus and Recognition using BERT
- URL: http://arxiv.org/abs/2205.09651v2
- Date: Mon, 23 May 2022 07:33:05 GMT
- Title: Wojood: Nested Arabic Named Entity Corpus and Recognition using BERT
- Authors: Mustafa Jarrar, Mohammed Khalilia, Sana Ghanem
- Abstract summary: Wojood consists of 550K Modern Standard Arabic (MSA) and dialect tokens that are manually annotated with 21 entity types.
The data contains about 75K entities and 22.5% of which are nested.
Our corpus, the annotation guidelines, the source code and the pre-trained model are publicly available.
- Score: 1.2891210250935146
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents Wojood, a corpus for Arabic nested Named Entity
Recognition (NER). Nested entities occur when one entity mention is embedded
inside another entity mention. Wojood consists of about 550K Modern Standard
Arabic (MSA) and dialect tokens that are manually annotated with 21 entity
types including person, organization, location, event and date. More
importantly, the corpus is annotated with nested entities instead of the more
common flat annotations. The data contains about 75K entities and 22.5% of
which are nested. The inter-annotator evaluation of the corpus demonstrated a
strong agreement with Cohen's Kappa of 0.979 and an F1-score of 0.976. To
validate our data, we used the corpus to train a nested NER model based on
multi-task learning and AraBERT (Arabic BERT). The model achieved an overall
micro F1-score of 0.884. Our corpus, the annotation guidelines, the source code
and the pre-trained model are publicly available.
Related papers
- Entity Disambiguation via Fusion Entity Decoding [68.77265315142296]
We propose an encoder-decoder model to disambiguate entities with more detailed entity descriptions.
We observe +1.5% improvements in end-to-end entity linking in the GERBIL benchmark compared with EntQA.
arXiv Detail & Related papers (2024-04-02T04:27:54Z) - Arabic Fine-Grained Entity Recognition [14.230912397408765]
This article aims to advance Arabic NER with fine-grained entities.
Four main entity types in Wojood, geopolitical entity (GPE), location (LOC), organization (ORG), and facility (FAC) are extended with 31 subtypes.
To do this, we first revised Wojood's annotations of GPE, LOC, ORG, and FAC to be compatible with the LDC's ACE guidelines.
All mentions of GPE, LOC, ORG, and FAC in Wojood are manually annotated with the LDC's ACE sub-types.
arXiv Detail & Related papers (2023-10-26T11:59:45Z) - ANER: Arabic and Arabizi Named Entity Recognition using
Transformer-Based Approach [0.0]
We present ANER, a web-based named entity recognizer for the Arabic, and Arabizi languages.
The model is built upon BERT, which is a transformer-based encoder.
It can recognize 50 different entity classes, covering various fields.
arXiv Detail & Related papers (2023-08-28T15:54:48Z) - People and Places of Historical Europe: Bootstrapping Annotation
Pipeline and a New Corpus of Named Entities in Late Medieval Texts [0.0]
We develop a new NER corpus of 3.6M sentences from late medieval charters written mainly in Czech, Latin, and German.
We show that we can start with a list of known historical figures and locations and an unannotated corpus of historical texts, and use information retrieval techniques to automatically bootstrap a NER-annotated corpus.
arXiv Detail & Related papers (2023-05-26T08:05:01Z) - Nonparametric Masked Language Modeling [113.71921977520864]
Existing language models (LMs) predict tokens with a softmax over a finite vocabulary.
We introduce NPM, the first nonparametric masked language model that replaces this softmax with a nonparametric distribution over every phrase in a reference corpus.
NPM can be efficiently trained with a contrastive objective and an in-batch approximation to full corpus retrieval.
arXiv Detail & Related papers (2022-12-02T18:10:42Z) - Enabling Classifiers to Make Judgements Explicitly Aligned with Human
Values [73.82043713141142]
Many NLP classification tasks, such as sexism/racism detection or toxicity detection, are based on human values.
We introduce a framework for value-aligned classification that performs prediction based on explicitly written human values in the command.
arXiv Detail & Related papers (2022-10-14T09:10:49Z) - Label Semantics for Few Shot Named Entity Recognition [68.01364012546402]
We study the problem of few shot learning for named entity recognition.
We leverage the semantic information in the names of the labels as a way of giving the model additional signal and enriched priors.
Our model learns to match the representations of named entities computed by the first encoder with label representations computed by the second encoder.
arXiv Detail & Related papers (2022-03-16T23:21:05Z) - KazNERD: Kazakh Named Entity Recognition Dataset [5.094176584161206]
We present the development of a dataset for Kazakh named entity recognition.
The dataset was built as there is a clear need for publicly available annotated corpora in Kazakh.
The resulting dataset contains 112,702 sentences and 136,333 annotations for 25 entity classes.
arXiv Detail & Related papers (2021-11-26T10:56:19Z) - MobIE: A German Dataset for Named Entity Recognition, Entity Linking and
Relation Extraction in the Mobility Domain [76.21775236904185]
dataset consists of 3,232 social media texts and traffic reports with 91K tokens, and contains 20.5K annotated entities.
A subset of the dataset is human-annotated with seven mobility-related, n-ary relation types.
To the best of our knowledge, this is the first German-language dataset that combines annotations for NER, EL and RE.
arXiv Detail & Related papers (2021-08-16T08:21:50Z) - Autoregressive Entity Retrieval [55.38027440347138]
Entities are at the center of how we represent and aggregate knowledge.
The ability to retrieve such entities given a query is fundamental for knowledge-intensive tasks such as entity linking and open-domain question answering.
We propose GENRE, the first system that retrieves entities by generating their unique names, left to right, token-by-token in an autoregressive fashion.
arXiv Detail & Related papers (2020-10-02T10:13:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.