AlbNews: A Corpus of Headlines for Topic Modeling in Albanian
- URL: http://arxiv.org/abs/2402.04028v1
- Date: Tue, 6 Feb 2024 14:24:28 GMT
- Title: AlbNews: A Corpus of Headlines for Topic Modeling in Albanian
- Authors: Erion \c{C}ano, Dario Lamaj
- Abstract summary: AlbNews is a collection of 600 topically labeled news headlines and 2600 unlabeled ones in Albanian.
The data can be freely used for conducting topic modeling research.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The scarcity of available text corpora for low-resource languages like
Albanian is a serious hurdle for research in natural language processing tasks.
This paper introduces AlbNews, a collection of 600 topically labeled news
headlines and 2600 unlabeled ones in Albanian. The data can be freely used for
conducting topic modeling research. We report the initial classification scores
of some traditional machine learning classifiers trained with the AlbNews
samples. These results show that basic models outrun the ensemble learning ones
and can serve as a baseline for future experiments.
Related papers
- A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - On the importance of Data Scale in Pretraining Arabic Language Models [46.431706010614334]
We conduct a comprehensive study on the role of data in Arabic Pretrained Language Models (PLMs)
We reassess the performance of a suite of state-of-the-art Arabic PLMs by retraining them on massive-scale, high-quality Arabic corpora.
Our analysis strongly suggests that pretraining data by far is the primary contributor to performance, surpassing other factors.
arXiv Detail & Related papers (2024-01-15T15:11:15Z) - AlbNER: A Corpus for Named Entity Recognition in Albanian [0.0]
This paper presents AlbNER, a corpus of 900 sentences with labeled named entities, collected from Albanian Wikipedia articles.
Preliminary results with BERT and RoBERTa variants fine-tuned and tested with AlbNER data indicate that model size has slight impact on NER performance, whereas language transfer has a significant one.
arXiv Detail & Related papers (2023-09-15T20:03:19Z) - Benchmarking Multilabel Topic Classification in the Kyrgyz Language [6.15353988889181]
We present a new public benchmark for topic classification in Kyrgyz based on collected and annotated data from the news site 24.KG.
We train and evaluate both classical statistical and neural models, reporting the scores, discussing the results, and proposing directions for future work.
arXiv Detail & Related papers (2023-08-30T11:02:26Z) - Cross-Lingual NER for Financial Transaction Data in Low-Resource
Languages [70.25418443146435]
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data.
We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information.
With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
arXiv Detail & Related papers (2023-07-16T00:45:42Z) - AlbMoRe: A Corpus of Movie Reviews for Sentiment Analysis in Albanian [0.0]
AlbMoRe is a corpus of 800 movie reviews in Albanian.
Each text is labeled as positive or negative and can be used for sentiment analysis research.
arXiv Detail & Related papers (2023-06-14T14:21:55Z) - UrduFake@FIRE2020: Shared Track on Fake News Identification in Urdu [62.6928395368204]
This paper gives the overview of the first shared task at FIRE 2020 on fake news detection in the Urdu language.
The goal is to identify fake news using a dataset composed of 900 annotated news articles for training and 400 news articles for testing.
The dataset contains news in five domains: (i) Health, (ii) Sports, (iii) Showbiz, (iv) Technology, and (v) Business.
arXiv Detail & Related papers (2022-07-25T03:46:51Z) - DziriBERT: a Pre-trained Language Model for the Algerian Dialect [2.064612766965483]
We study the Algerian dialect which has several specificities that make the use of Arabic or multilingual models inappropriate.
To address this issue, we collected more than one Million Algerian tweets, and pre-trained the first Algerian language model: DziriBERT.
arXiv Detail & Related papers (2021-09-25T11:51:35Z) - New Arabic Medical Dataset for Diseases Classification [55.41644538483948]
We introduce a new Arab medical dataset, which includes two thousand medical documents collected from several Arabic medical websites.
The dataset was built for the task of classifying texts and includes 10 classes (Blood, Bone, Cardiovascular, Ear, Endocrine, Eye, Gastrointestinal, Immune, Liver and Nephrological)
Experiments on the dataset were performed by fine-tuning three pre-trained models: BERT from Google, Arabert that based on BERT with large Arabic corpus, and AraBioNER that based on Arabert with Arabic medical corpus.
arXiv Detail & Related papers (2021-06-29T10:42:53Z) - An Amharic News Text classification Dataset [0.0]
We aim to introduce the Amharic text classification dataset that consists of more than 50k news articles that were categorized into 6 classes.
This dataset is made available with easy baseline performances to encourage studies and better performance experiments.
arXiv Detail & Related papers (2021-03-10T16:36:39Z) - Soft Gazetteers for Low-Resource Named Entity Recognition [78.00856159473393]
We propose a method of "soft gazetteers" that incorporates ubiquitously available information from English knowledge bases into neural named entity recognition models.
Our experiments on four low-resource languages show an average improvement of 4 points in F1 score.
arXiv Detail & Related papers (2020-05-04T21:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.