Idiom Detection in Sorani Kurdish Texts
- URL: http://arxiv.org/abs/2501.14528v2
- Date: Thu, 30 Jan 2025 10:15:35 GMT
- Title: Idiom Detection in Sorani Kurdish Texts
- Authors: Skala Kamaran Omer, Hossein Hassani,
- Abstract summary: This study addresses detection in Sorani Kurdish by approaching it as a text classification task using deep learning techniques.
We developed and evaluated three deep learning models: KuBERT-based transformer sequence classification, a Recurrent Convolutional Neural Network (RCNN), and a BiLSTM model with an attention mechanism.
The evaluations revealed that the transformer model, the fine-tuned BERT, consistently outperformed the others, achieving nearly 99% accuracy.
- Score: 1.174020933567308
- License:
- Abstract: Idiom detection using Natural Language Processing (NLP) is the computerized process of recognizing figurative expressions within a text that convey meanings beyond the literal interpretation of the words. While idiom detection has seen significant progress across various languages, the Kurdish language faces a considerable research gap in this area despite the importance of idioms in tasks like machine translation and sentiment analysis. This study addresses idiom detection in Sorani Kurdish by approaching it as a text classification task using deep learning techniques. To tackle this, we developed a dataset containing 10,580 sentences embedding 101 Sorani Kurdish idioms across diverse contexts. Using this dataset, we developed and evaluated three deep learning models: KuBERT-based transformer sequence classification, a Recurrent Convolutional Neural Network (RCNN), and a BiLSTM model with an attention mechanism. The evaluations revealed that the transformer model, the fine-tuned BERT, consistently outperformed the others, achieving nearly 99% accuracy while the RCNN achieved 96.5% and the BiLSTM 80%. These results highlight the effectiveness of Transformer-based architectures in low-resource languages like Kurdish. This research provides a dataset, three optimized models, and insights into idiom detection, laying a foundation for advancing Kurdish NLP.
Related papers
- NER- RoBERTa: Fine-Tuning RoBERTa for Named Entity Recognition (NER) within low-resource languages [3.5403652483328223]
This work proposes a methodology for fine-tuning the pre-trained RoBERTa model for Kurdish NER (KNER)
Experiments show that fine-tuned RoBERTa with the SentencePiece tokenization method substantially improves KNER performance.
arXiv Detail & Related papers (2024-12-15T07:07:17Z) - End-to-End Transformer-based Automatic Speech Recognition for Northern Kurdish: A Pioneering Approach [1.3689715712707342]
This paper introduces a study exploring the effectiveness of Whisper, a pre-trained ASR model, for Northern Kurdish (Kurmanji) an under-resourced language spoken in the Middle East.
Using a Northern Kurdish fine-tuning speech corpus containing approximately 68 hours of validated transcribed data, our experiments demonstrate that the additional module fine-tuning strategy significantly improves ASR accuracy.
arXiv Detail & Related papers (2024-10-19T11:46:30Z) - Shifting from endangerment to rebirth in the Artificial Intelligence Age: An Ensemble Machine Learning Approach for Hawrami Text Classification [1.174020933567308]
Hawrami, a dialect of Kurdish, is classified as an endangered language.
This paper introduces various text classification models using a dataset of 6,854 articles in Hawrami labeled into 15 categories by two native speakers.
arXiv Detail & Related papers (2024-09-25T12:52:21Z) - A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - Towards Faster k-Nearest-Neighbor Machine Translation [51.866464707284635]
k-nearest-neighbor machine translation approaches suffer from heavy retrieve overhead on the entire datastore when decoding each token.
We propose a simple yet effective multi-layer perceptron (MLP) network to predict whether a token should be translated jointly by the neural machine translation model and probabilities produced by the kNN.
Our method significantly reduces the overhead of kNN retrievals by up to 53% at the expense of a slight decline in translation quality.
arXiv Detail & Related papers (2023-12-12T16:41:29Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Convolutional Neural Networks for Sentiment Analysis on Weibo Data: A
Natural Language Processing Approach [0.228438857884398]
This study addresses the complex task of sentiment analysis on a dataset of 119,988 original tweets from Weibo using a Convolutional Neural Network (CNN)
A CNN-based model was utilized, leveraging word embeddings for feature extraction, and trained to perform sentiment classification.
The model achieved a macro-average F1-score of approximately 0.73 on the test set, showing balanced performance across positive, neutral, and negative sentiments.
arXiv Detail & Related papers (2023-07-13T03:02:56Z) - Retrieval-based Disentangled Representation Learning with Natural
Language Supervision [61.75109410513864]
We present Vocabulary Disentangled Retrieval (VDR), a retrieval-based framework that harnesses natural language as proxies of the underlying data variation to drive disentangled representation learning.
Our approach employ a bi-encoder model to represent both data and natural language in a vocabulary space, enabling the model to distinguish intrinsic dimensions that capture characteristics within data through its natural language counterpart, thus disentanglement.
arXiv Detail & Related papers (2022-12-15T10:20:42Z) - No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages.
We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks.
Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z) - An Attention Ensemble Approach for Efficient Text Classification of
Indian Languages [0.0]
This paper focuses on the coarse-grained technical domain identification of short text documents in Marathi, a Devanagari script-based Indian language.
A hybrid CNN-BiLSTM attention ensemble model is proposed that competently combines the intermediate sentence representations generated by the convolutional neural network and the bidirectional long short-term memory, leading to efficient text classification.
Experimental results show that the proposed model outperforms various baseline machine learning and deep learning models in the given task, giving the best validation accuracy of 89.57% and f1-score of 0.8875.
arXiv Detail & Related papers (2021-02-20T07:31:38Z) - Explicit Alignment Objectives for Multilingual Bidirectional Encoders [111.65322283420805]
We present a new method for learning multilingual encoders, AMBER (Aligned Multilingual Bi-directional EncodeR)
AMBER is trained on additional parallel data using two explicit alignment objectives that align the multilingual representations at different granularities.
Experimental results show that AMBER obtains gains of up to 1.1 average F1 score on sequence tagging and up to 27.3 average accuracy on retrieval over the XLMR-large model.
arXiv Detail & Related papers (2020-10-15T18:34:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.