ANER: Arabic and Arabizi Named Entity Recognition using
Transformer-Based Approach
- URL: http://arxiv.org/abs/2308.14669v1
- Date: Mon, 28 Aug 2023 15:54:48 GMT
- Title: ANER: Arabic and Arabizi Named Entity Recognition using
Transformer-Based Approach
- Authors: Abdelrahman "Boda" Sadallah, Omar Ahmed, Shimaa Mohamed, Omar Hatem,
Doaa Hesham, Ahmed H. Yousef
- Abstract summary: We present ANER, a web-based named entity recognizer for the Arabic, and Arabizi languages.
The model is built upon BERT, which is a transformer-based encoder.
It can recognize 50 different entity classes, covering various fields.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: One of the main tasks of Natural Language Processing (NLP), is Named Entity
Recognition (NER). It is used in many applications and also can be used as an
intermediate step for other tasks. We present ANER, a web-based named entity
recognizer for the Arabic, and Arabizi languages. The model is built upon BERT,
which is a transformer-based encoder. It can recognize 50 different entity
classes, covering various fields. We trained our model on the WikiFANE\_Gold
dataset which consists of Wikipedia articles. We achieved an F1 score of
88.7\%, which beats CAMeL Tools' F1 score of 83\% on the ANERcorp dataset,
which has only 4 classes. We also got an F1 score of 77.7\% on the
NewsFANE\_Gold dataset which contains out-of-domain data from News articles.
The system is deployed on a user-friendly web interface that accepts users'
inputs in Arabic, or Arabizi. It allows users to explore the entities in the
text by highlighting them. It can also direct users to get information about
entities through Wikipedia directly. We added the ability to do NER using our
model, or CAMeL Tools' model through our website. ANER is publicly accessible
at \url{http://www.aner.online}. We also deployed our model on HuggingFace at
https://huggingface.co/boda/ANER, to allow developers to test and use it.
Related papers
- A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - Multicultural Name Recognition For Previously Unseen Names [65.268245109828]
This paper attempts to improve recognition of person names, a diverse category that can grow any time someone is born or changes their name.
I look at names from 103 countries to compare how well the model performs on names from different cultures.
I find that a model with combined character and word input outperforms word-only models and may improve on accuracy compared to classical NER models.
arXiv Detail & Related papers (2024-01-23T17:58:38Z) - Using LSTM and GRU With a New Dataset for Named Entity Recognition in
the Arabic Language [0.0]
We use the BIOES format to tag the word, which allows us to handle the nested name entity.
This work proposes long short term memory (LSTM) units and Gated Recurrent Units (GRU) for building the named entity recognition model in the Arabic language.
arXiv Detail & Related papers (2023-04-06T22:14:02Z) - AsNER -- Annotated Dataset and Baseline for Assamese Named Entity
recognition [7.252817150901275]
The proposed NER dataset is likely to be a significant resource for deep neural based Assamese language processing.
We benchmark the dataset by training NER models and evaluating using state-of-the-art architectures for supervised named entity recognition.
The highest F1-score among all baselines achieves an accuracy of 80.69% when using MuRIL as a word embedding method.
arXiv Detail & Related papers (2022-07-07T16:45:55Z) - Wojood: Nested Arabic Named Entity Corpus and Recognition using BERT [1.2891210250935146]
Wojood consists of 550K Modern Standard Arabic (MSA) and dialect tokens that are manually annotated with 21 entity types.
The data contains about 75K entities and 22.5% of which are nested.
Our corpus, the annotation guidelines, the source code and the pre-trained model are publicly available.
arXiv Detail & Related papers (2022-05-19T16:06:49Z) - HiNER: A Large Hindi Named Entity Recognition Dataset [29.300418937509317]
This paper releases a standard-abiding Hindi NER dataset containing 109,146 sentences and 2,220,856 tokens, annotated with 11 tags.
The statistics of tag-set in our dataset show a healthy per-tag distribution, especially for prominent classes like Person, Location and Organisation.
Our dataset helps achieve a weighted F1 score of 88.78 with all the tags and 92.22 when we collapse the tag-set, as discussed in the paper.
arXiv Detail & Related papers (2022-04-28T19:14:21Z) - NEREL: A Russian Dataset with Nested Named Entities and Relations [55.69103749079697]
We present NEREL, a Russian dataset for named entity recognition and relation extraction.
It contains 56K annotated named entities and 39K annotated relations.
arXiv Detail & Related papers (2021-08-30T10:40:20Z) - MobIE: A German Dataset for Named Entity Recognition, Entity Linking and
Relation Extraction in the Mobility Domain [76.21775236904185]
dataset consists of 3,232 social media texts and traffic reports with 91K tokens, and contains 20.5K annotated entities.
A subset of the dataset is human-annotated with seven mobility-related, n-ary relation types.
To the best of our knowledge, this is the first German-language dataset that combines annotations for NER, EL and RE.
arXiv Detail & Related papers (2021-08-16T08:21:50Z) - Autoregressive Entity Retrieval [55.38027440347138]
Entities are at the center of how we represent and aggregate knowledge.
The ability to retrieve such entities given a query is fundamental for knowledge-intensive tasks such as entity linking and open-domain question answering.
We propose GENRE, the first system that retrieves entities by generating their unique names, left to right, token-by-token in an autoregressive fashion.
arXiv Detail & Related papers (2020-10-02T10:13:31Z) - Soft Gazetteers for Low-Resource Named Entity Recognition [78.00856159473393]
We propose a method of "soft gazetteers" that incorporates ubiquitously available information from English knowledge bases into neural named entity recognition models.
Our experiments on four low-resource languages show an average improvement of 4 points in F1 score.
arXiv Detail & Related papers (2020-05-04T21:58:02Z) - Beheshti-NER: Persian Named Entity Recognition Using BERT [0.0]
In this paper, we use the pre-trained deep bidirectional network, BERT, to make a model for named entity recognition in Persian.
Our results are 83.5 and 88.4 f1 CONLL score respectively in phrase and word level evaluation.
arXiv Detail & Related papers (2020-03-19T15:55:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.