KurdSTS: The Kurdish Semantic Textual Similarity
- URL: http://arxiv.org/abs/2510.02336v1
- Date: Fri, 26 Sep 2025 14:55:55 GMT
- Title: KurdSTS: The Kurdish Semantic Textual Similarity
- Authors: Abdulhady Abas Abdullah, Hadi Veisi, Hussein M. Al,
- Abstract summary: 10,000 sentence pairs spanning formal and informal registers, each annotated for similarity.<n>We benchmark Sentence-BERT, multilingual BERT, and other strong baselines, obtaining competitive results while highlighting challenges arising from Kurdish morphology, orthographic variation, and code-mixing.<n>The dataset and baselines establish a reproducible evaluation suite and provide a strong starting point for future research on Kurdish semantics and low-resource NLP.
- Score: 0.979204203262436
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Semantic Textual Similarity (STS) measures the degree of meaning overlap between two texts and underpins many NLP tasks. While extensive resources exist for high-resource languages, low-resource languages such as Kurdish remain underserved. We present, to our knowledge, the first Kurdish STS dataset: 10,000 sentence pairs spanning formal and informal registers, each annotated for similarity. We benchmark Sentence-BERT, multilingual BERT, and other strong baselines, obtaining competitive results while highlighting challenges arising from Kurdish morphology, orthographic variation, and code-mixing. The dataset and baselines establish a reproducible evaluation suite and provide a strong starting point for future research on Kurdish semantics and low-resource NLP.
Related papers
- KuBERT: Central Kurdish BERT Model and Its Application for Sentiment Analysis [0.979204203262436]
This paper enhances the study of sentiment analysis for the Central Kurdish language by integrating the Bidirectional Representations from Transformers (BERT) into Natural Language Processing techniques.
arXiv Detail & Related papers (2025-09-20T20:44:29Z) - L3Cube-IndicHeadline-ID: A Dataset for Headline Identification and Semantic Evaluation in Low-Resource Indian Languages [2.584263027095689]
L3Cube-IndicHeadline-ID is a curated dataset spanning ten low-resource Indic languages.<n>Each language includes 20,000 news articles paired with four headline variants.<n>The task requires selecting the correct headline from the options using article-headline similarity.<n>We benchmark several sentence transformers, including multilingual and language-specific models, using cosine similarity.
arXiv Detail & Related papers (2025-09-02T16:54:30Z) - Idiom Detection in Sorani Kurdish Texts [1.174020933567308]
This study addresses detection in Sorani Kurdish by approaching it as a text classification task using deep learning techniques.<n>We developed and evaluated three deep learning models: KuBERT-based transformer sequence classification, a Recurrent Convolutional Neural Network (RCNN), and a BiLSTM model with an attention mechanism.<n>The evaluations revealed that the transformer model, the fine-tuned BERT, consistently outperformed the others, achieving nearly 99% accuracy.
arXiv Detail & Related papers (2025-01-24T14:31:30Z) - Non-Contextual BERT or FastText? A Comparative Analysis [0.4194295877935868]
We analyze the effectiveness of non-contextual embeddings from BERT models and FastText models for tasks such as news classification, sentiment analysis, and hate speech detection.<n>Our findings indicate that non-contextual BERT embeddings extracted from the model's first embedding layer outperform FastText embeddings.
arXiv Detail & Related papers (2024-11-26T18:25:57Z) - A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - From Multiple-Choice to Extractive QA: A Case Study for English and Arabic [51.13706104333848]
We explore the feasibility of repurposing an existing multilingual dataset for a new NLP task.<n>We present annotation guidelines and a parallel EQA dataset for English and Modern Standard Arabic.<n>We aim to help others adapt our approach for the remaining 120 BELEBELE language variants, many of which are deemed under-resourced.
arXiv Detail & Related papers (2024-04-26T11:46:05Z) - Low-Resource Named Entity Recognition with Cross-Lingual, Character-Level Neural Conditional Random Fields [68.17213992395041]
Low-resource named entity recognition is still an open problem in NLP.
We present a transfer learning scheme, whereby we train character-level neural CRFs to predict named entities for both high-resource languages and low resource languages jointly.
arXiv Detail & Related papers (2024-04-14T23:44:49Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Beyond Counting Datasets: A Survey of Multilingual Dataset Construction
and Necessary Resources [38.814057529254846]
We examine the characteristics of 156 publicly available NLP datasets.
We survey language-proficient NLP researchers and crowd workers per language.
We identify strategies for collecting high-quality multilingual data on the Mechanical Turk platform.
arXiv Detail & Related papers (2022-11-28T18:54:33Z) - Mukayese: Turkish NLP Strikes Back [0.19116784879310023]
We demonstrate that languages such as Turkish are left behind the state-of-the-art in NLP applications.
We present Mukayese, a set of NLP benchmarks for the Turkish language.
We present four new benchmarking datasets in Turkish for language modeling, sentence segmentation, and spell checking.
arXiv Detail & Related papers (2022-03-02T16:18:44Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual
Lexical Semantic Similarity [67.36239720463657]
Multi-SimLex is a large-scale lexical resource and evaluation benchmark covering datasets for 12 diverse languages.
Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs.
Owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity datasets.
arXiv Detail & Related papers (2020-03-10T17:17:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.