KuBERT: Central Kurdish BERT Model and Its Application for Sentiment Analysis
- URL: http://arxiv.org/abs/2509.16804v1
- Date: Sat, 20 Sep 2025 20:44:29 GMT
- Title: KuBERT: Central Kurdish BERT Model and Its Application for Sentiment Analysis
- Authors: Kozhin muhealddin Awlla, Hadi Veisi, Abdulhady Abas Abdullah,
- Abstract summary: This paper enhances the study of sentiment analysis for the Central Kurdish language by integrating the Bidirectional Representations from Transformers (BERT) into Natural Language Processing techniques.
- Score: 0.979204203262436
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper enhances the study of sentiment analysis for the Central Kurdish language by integrating the Bidirectional Encoder Representations from Transformers (BERT) into Natural Language Processing techniques. Kurdish is a low-resourced language, having a high level of linguistic diversity with minimal computational resources, making sentiment analysis somewhat challenging. Earlier, this was done using a traditional word embedding model, such as Word2Vec, but with the emergence of new language models, specifically BERT, there is hope for improvements. The better word embedding capabilities of BERT lend to this study, aiding in the capturing of the nuanced semantic pool and the contextual intricacies of the language under study, the Kurdish language, thus setting a new benchmark for sentiment analysis in low-resource languages.
Related papers
- KurdSTS: The Kurdish Semantic Textual Similarity [0.979204203262436]
10,000 sentence pairs spanning formal and informal registers, each annotated for similarity.<n>We benchmark Sentence-BERT, multilingual BERT, and other strong baselines, obtaining competitive results while highlighting challenges arising from Kurdish morphology, orthographic variation, and code-mixing.<n>The dataset and baselines establish a reproducible evaluation suite and provide a strong starting point for future research on Kurdish semantics and low-resource NLP.
arXiv Detail & Related papers (2025-09-26T14:55:55Z) - Idiom Detection in Sorani Kurdish Texts [1.174020933567308]
This study addresses detection in Sorani Kurdish by approaching it as a text classification task using deep learning techniques.<n>We developed and evaluated three deep learning models: KuBERT-based transformer sequence classification, a Recurrent Convolutional Neural Network (RCNN), and a BiLSTM model with an attention mechanism.<n>The evaluations revealed that the transformer model, the fine-tuned BERT, consistently outperformed the others, achieving nearly 99% accuracy.
arXiv Detail & Related papers (2025-01-24T14:31:30Z) - NER- RoBERTa: Fine-Tuning RoBERTa for Named Entity Recognition (NER) within low-resource languages [3.5403652483328223]
This work proposes a methodology for fine-tuning the pre-trained RoBERTa model for Kurdish NER (KNER)<n>Experiments show that fine-tuned RoBERTa with the SentencePiece tokenization method substantially improves KNER performance.
arXiv Detail & Related papers (2024-12-15T07:07:17Z) - MoSECroT: Model Stitching with Static Word Embeddings for Crosslingual Zero-shot Transfer [50.40191599304911]
We introduce MoSECroT Model Stitching with Static Word Embeddings for Crosslingual Zero-shot Transfer.
In this paper, we present the first framework that leverages relative representations to construct a common space for the embeddings of a source language PLM and the static word embeddings of a target language.
We show that although our proposed framework is competitive with weak baselines when addressing MoSECroT, it fails to achieve competitive results compared with some strong baselines.
arXiv Detail & Related papers (2024-01-09T21:09:07Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Hunspell for Sorani Kurdish Spell Checking and Morphological Analysis [0.0]
We present our efforts in annotating a lexicon with morphosyntactic tags and also, extracting morphological rules of Sorani Kurdish to build a morphological analyzer, a stemmer and a spell-checking system using Hunspell.
This implementation can be used for further developments in the field by researchers and also, be integrated into text editors under a publicly available license.
arXiv Detail & Related papers (2021-09-14T00:24:20Z) - Continual Mixed-Language Pre-Training for Extremely Low-Resource Neural
Machine Translation [53.22775597051498]
We present a continual pre-training framework on mBART to effectively adapt it to unseen languages.
Results show that our method can consistently improve the fine-tuning performance upon the mBART baseline.
Our approach also boosts the performance on translation pairs where both languages are seen in the original mBART's pre-training.
arXiv Detail & Related papers (2021-05-09T14:49:07Z) - Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language
Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements.
We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations.
Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Towards Finite-State Morphology of Kurdish [0.76146285961466]
The morphology of the Kurdish language (Sorani dialect) is described from a computational point of view.
We extract morphological rules which are transformed into finite-state transducers for generating and analyzing words.
arXiv Detail & Related papers (2020-05-21T13:55:07Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.