AlbNER: A Corpus for Named Entity Recognition in Albanian
- URL: http://arxiv.org/abs/2309.08741v1
- Date: Fri, 15 Sep 2023 20:03:19 GMT
- Title: AlbNER: A Corpus for Named Entity Recognition in Albanian
- Authors: Erion \c{C}ano
- Abstract summary: This paper presents AlbNER, a corpus of 900 sentences with labeled named entities, collected from Albanian Wikipedia articles.
Preliminary results with BERT and RoBERTa variants fine-tuned and tested with AlbNER data indicate that model size has slight impact on NER performance, whereas language transfer has a significant one.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scarcity of resources such as annotated text corpora for under-resourced
languages like Albanian is a serious impediment in computational linguistics
and natural language processing research. This paper presents AlbNER, a corpus
of 900 sentences with labeled named entities, collected from Albanian Wikipedia
articles. Preliminary results with BERT and RoBERTa variants fine-tuned and
tested with AlbNER data indicate that model size has slight impact on NER
performance, whereas language transfer has a significant one. AlbNER corpus and
these obtained results should serve as baselines for future experiments.
Related papers
- FASSILA: A Corpus for Algerian Dialect Fake News Detection and Sentiment Analysis [0.0]
The Algerian dialect (AD) faces challenges due to the absence of annotated corpora.
This study outlines the development process of a specialized corpus for Fake News (FN) detection and sentiment analysis (SA) in AD called FASSILA.
arXiv Detail & Related papers (2024-11-07T10:39:10Z) - Low-Resource Named Entity Recognition with Cross-Lingual, Character-Level Neural Conditional Random Fields [68.17213992395041]
Low-resource named entity recognition is still an open problem in NLP.
We present a transfer learning scheme, whereby we train character-level neural CRFs to predict named entities for both high-resource languages and low resource languages jointly.
arXiv Detail & Related papers (2024-04-14T23:44:49Z) - AlbNews: A Corpus of Headlines for Topic Modeling in Albanian [0.0]
AlbNews is a collection of 600 topically labeled news headlines and 2600 unlabeled ones in Albanian.
The data can be freely used for conducting topic modeling research.
arXiv Detail & Related papers (2024-02-06T14:24:28Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - AlbMoRe: A Corpus of Movie Reviews for Sentiment Analysis in Albanian [0.0]
AlbMoRe is a corpus of 800 movie reviews in Albanian.
Each text is labeled as positive or negative and can be used for sentiment analysis research.
arXiv Detail & Related papers (2023-06-14T14:21:55Z) - Extract and Attend: Improving Entity Translation in Neural Machine
Translation [141.7840980565706]
We propose an Extract-and-Attend approach to enhance entity translation in NMT.
The proposed method is effective on improving both the translation accuracy of entities and the overall translation quality.
arXiv Detail & Related papers (2023-06-04T03:05:25Z) - CROP: Zero-shot Cross-lingual Named Entity Recognition with Multilingual
Labeled Sequence Translation [113.99145386490639]
Cross-lingual NER can transfer knowledge between languages via aligned cross-lingual representations or machine translation results.
We propose a Cross-lingual Entity Projection framework (CROP) to enable zero-shot cross-lingual NER.
We adopt a multilingual labeled sequence translation model to project the tagged sequence back to the target language and label the target raw sentence.
arXiv Detail & Related papers (2022-10-13T13:32:36Z) - ArNLI: Arabic Natural Language Inference for Entailment and
Contradiction Detection [1.8275108630751844]
We have created a data set of more than 12k sentences and named ArNLI, that will be publicly available.
We proposed an approach to detect contradictions between pairs of sentences in Arabic language using contradiction vector combined with language model vector as an input to machine learning model.
Best results achieved using Random Forest classifier with an accuracy of 99%, 60%, 75% on PHEME, SICK and ArNLI respectively.
arXiv Detail & Related papers (2022-09-28T09:37:16Z) - DEEP: DEnoising Entity Pre-training for Neural Machine Translation [123.6686940355937]
It has been shown that machine translation models usually generate poor translations for named entities that are infrequent in the training corpus.
We propose DEEP, a DEnoising Entity Pre-training method that leverages large amounts of monolingual data and a knowledge base to improve named entity translation accuracy within sentences.
arXiv Detail & Related papers (2021-11-14T17:28:09Z) - An analysis of full-size Russian complexly NER labelled corpus of
Internet user reviews on the drugs based on deep learning and language neural
nets [94.37521840642141]
We present the full-size Russian complexly NER-labeled corpus of Internet user reviews.
A set of advanced deep learning neural networks is used to extract pharmacologically meaningful entities from Russian texts.
arXiv Detail & Related papers (2021-04-30T19:46:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.