Using LSTM and GRU With a New Dataset for Named Entity Recognition in
the Arabic Language
- URL: http://arxiv.org/abs/2304.03399v1
- Date: Thu, 6 Apr 2023 22:14:02 GMT
- Title: Using LSTM and GRU With a New Dataset for Named Entity Recognition in
the Arabic Language
- Authors: Alaa Shaker, Alaa Aldarf and Igor Bessmertny
- Abstract summary: We use the BIOES format to tag the word, which allows us to handle the nested name entity.
This work proposes long short term memory (LSTM) units and Gated Recurrent Units (GRU) for building the named entity recognition model in the Arabic language.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Named entity recognition (NER) is a natural language processing task (NLP),
which aims to identify named entities and classify them like person, location,
organization, etc. In the Arabic language, we can find a considerable size of
unstructured data, and it needs to different preprocessing tool than languages
like (English, Russian, German...). From this point, we can note the importance
of building a new structured dataset to solve the lack of structured data. In
this work, we use the BIOES format to tag the word, which allows us to handle
the nested name entity that consists of more than one sentence and define the
start and the end of the name. The dataset consists of more than thirty-six
thousand records. In addition, this work proposes long short term memory (LSTM)
units and Gated Recurrent Units (GRU) for building the named entity recognition
model in the Arabic language. The models give an approximately good result
(80%) because LSTM and GRU models can find the relationships between the words
of the sentence. Also, use a new library from Google, which is Trax and
platform Colab
Related papers
- Less is More: Making Smaller Language Models Competent Subgraph Retrievers for Multi-hop KGQA [51.3033125256716]
We model the subgraph retrieval task as a conditional generation task handled by small language models.
Our base generative subgraph retrieval model, consisting of only 220M parameters, competitive retrieval performance compared to state-of-the-art models.
Our largest 3B model, when plugged with an LLM reader, sets new SOTA end-to-end performance on both the WebQSP and CWQ benchmarks.
arXiv Detail & Related papers (2024-10-08T15:22:36Z) - Multicultural Name Recognition For Previously Unseen Names [65.268245109828]
This paper attempts to improve recognition of person names, a diverse category that can grow any time someone is born or changes their name.
I look at names from 103 countries to compare how well the model performs on names from different cultures.
I find that a model with combined character and word input outperforms word-only models and may improve on accuracy compared to classical NER models.
arXiv Detail & Related papers (2024-01-23T17:58:38Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - CompoundPiece: Evaluating and Improving Decompounding Performance of
Language Models [77.45934004406283]
We systematically study decompounding, the task of splitting compound words into their constituents.
We introduce a dataset of 255k compound and non-compound words across 56 diverse languages obtained from Wiktionary.
We introduce a novel methodology to train dedicated models for decompounding.
arXiv Detail & Related papers (2023-05-23T16:32:27Z) - Building Multilingual Corpora for a Complex Named Entity Recognition and
Classification Hierarchy using Wikipedia and DBpedia [0.0]
We present the UNER dataset, a multilingual and hierarchical parallel corpus annotated for named-entities.
We describe in detail the developed procedure necessary to create this type of dataset in any language available on Wikipedia with DBpedia information.
arXiv Detail & Related papers (2022-12-14T11:38:48Z) - AsNER -- Annotated Dataset and Baseline for Assamese Named Entity
recognition [7.252817150901275]
The proposed NER dataset is likely to be a significant resource for deep neural based Assamese language processing.
We benchmark the dataset by training NER models and evaluating using state-of-the-art architectures for supervised named entity recognition.
The highest F1-score among all baselines achieves an accuracy of 80.69% when using MuRIL as a word embedding method.
arXiv Detail & Related papers (2022-07-07T16:45:55Z) - Part-of-Speech Tagging of Odia Language Using statistical and Deep
Learning-Based Approaches [0.0]
This research work is to present a conditional random field (CRF) and deep learning-based approaches (CNN and Bi-LSTM) to develop Odia part-of-speech tagger.
It has been observed that Bi-LSTM model with character sequence feature and pre-trained word vector achieved a significant state-of-the-art result.
arXiv Detail & Related papers (2022-07-07T12:15:23Z) - CREER: A Large-Scale Corpus for Relation Extraction and Entity
Recognition [9.54366784050374]
The CREER dataset uses the Stanford CoreNLP Annotator to capture rich language structures from Wikipedia plain text.
This dataset follows widely used linguistic and semantic annotations so that it can be used for not only most natural language processing tasks but also scaling the dataset.
arXiv Detail & Related papers (2022-04-27T05:43:21Z) - Multilingual Autoregressive Entity Linking [49.35994386221958]
mGENRE is a sequence-to-sequence system for the Multilingual Entity Linking problem.
For a mention in a given language, mGENRE predicts the name of the target entity left-to-right, token-by-token.
We show the efficacy of our approach through extensive evaluation including experiments on three popular MEL benchmarks.
arXiv Detail & Related papers (2021-03-23T13:25:55Z) - Structured Prediction as Translation between Augmented Natural Languages [109.50236248762877]
We propose a new framework, Translation between Augmented Natural Languages (TANL), to solve many structured prediction language tasks.
Instead of tackling the problem by training task-specific discriminatives, we frame it as a translation task between augmented natural languages.
Our approach can match or outperform task-specific models on all tasks, and in particular, achieves new state-of-the-art results on joint entity and relation extraction.
arXiv Detail & Related papers (2021-01-14T18:32:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.