Machine and Deep Learning Methods with Manual and Automatic Labelling
for News Classification in Bangla Language
- URL: http://arxiv.org/abs/2210.10903v1
- Date: Wed, 19 Oct 2022 21:53:49 GMT
- Title: Machine and Deep Learning Methods with Manual and Automatic Labelling
for News Classification in Bangla Language
- Authors: Istiak Ahmad, Fahad AlQurashi, Rashid Mehmood
- Abstract summary: This paper introduces several machine and deep learning methods with manual and automatic labelling for news classification in the Bangla language.
We implement several machine (ML) and deep learning (DL) algorithms. The ML algorithms are Logistic Regression (LR), Gradient Descent (SGD), Support Vector Machine (SVM), Random Forest (RF), and K-Nearest Neighbour (KNN)
We develop automatic labelling methods using Latent Dirichlet Allocation (LDA) and investigate the performance of single-label and multi-label article classification methods.
- Score: 0.36832029288386137
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Research in Natural Language Processing (NLP) has increasingly become
important due to applications such as text classification, text mining,
sentiment analysis, POS tagging, named entity recognition, textual entailment,
and many others. This paper introduces several machine and deep learning
methods with manual and automatic labelling for news classification in the
Bangla language. We implemented several machine (ML) and deep learning (DL)
algorithms. The ML algorithms are Logistic Regression (LR), Stochastic Gradient
Descent (SGD), Support Vector Machine (SVM), Random Forest (RF), and K-Nearest
Neighbour (KNN), used with Bag of Words (BoW), Term Frequency-Inverse Document
Frequency (TF-IDF), and Doc2Vec embedding models. The DL algorithms are Long
Short-Term Memory (LSTM), Bidirectional LSTM (BiLSTM), Gated Recurrent Unit
(GRU), and Convolutional Neural Network (CNN), used with Word2vec, Glove, and
FastText word embedding models. We develop automatic labelling methods using
Latent Dirichlet Allocation (LDA) and investigate the performance of
single-label and multi-label article classification methods. To investigate
performance, we developed from scratch Potrika, the largest and the most
extensive dataset for news classification in the Bangla language, comprising
185.51 million words and 12.57 million sentences contained in 664,880 news
articles in eight distinct categories, curated from six popular online news
portals in Bangladesh for the period 2014-2020. GRU and Fasttext with 91.83%
achieve the highest accuracy for manually-labelled data. For the automatic
labelling case, KNN and Doc2Vec at 57.72% and 75% achieve the highest accuracy
for single-label and multi-label data, respectively. The methods developed in
this paper are expected to advance research in Bangla and other languages.
Related papers
- A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - Description-Enhanced Label Embedding Contrastive Learning for Text
Classification [65.01077813330559]
Self-Supervised Learning (SSL) in model learning process and design a novel self-supervised Relation of Relation (R2) classification task.
Relation of Relation Learning Network (R2-Net) for text classification, in which text classification and R2 classification are treated as optimization targets.
external knowledge from WordNet to obtain multi-aspect descriptions for label semantic learning.
arXiv Detail & Related papers (2023-06-15T02:19:34Z) - Enhancing Pashto Text Classification using Language Processing
Techniques for Single And Multi-Label Analysis [0.0]
This study aims to establish an automated classification system for Pashto text.
The study achieved an average testing accuracy rate of 94%.
The use of pre-trained language representation models, such as DistilBERT, showed promising results.
arXiv Detail & Related papers (2023-05-04T23:11:31Z) - Tuning Traditional Language Processing Approaches for Pashto Text
Classification [0.0]
The main aim of this study is to establish a Pashto automatic text classification system.
This study compares several models containing both statistical and neural network machine learning techniques.
This research obtained average testing accuracy rate 94% using classification algorithm and TFIDF feature extraction method.
arXiv Detail & Related papers (2023-05-04T22:57:45Z) - Part-of-Speech Tagging of Odia Language Using statistical and Deep
Learning-Based Approaches [0.0]
This research work is to present a conditional random field (CRF) and deep learning-based approaches (CNN and Bi-LSTM) to develop Odia part-of-speech tagger.
It has been observed that Bi-LSTM model with character sequence feature and pre-trained word vector achieved a significant state-of-the-art result.
arXiv Detail & Related papers (2022-07-07T12:15:23Z) - Trustable Co-label Learning from Multiple Noisy Annotators [68.59187658490804]
Supervised deep learning depends on massive accurately annotated examples.
A typical alternative is learning from multiple noisy annotators.
This paper proposes a data-efficient approach, called emphTrustable Co-label Learning (TCL)
arXiv Detail & Related papers (2022-03-08T16:57:00Z) - Revisiting Self-Training for Few-Shot Learning of Language Model [61.173976954360334]
Unlabeled data carry rich task-relevant information, they are proven useful for few-shot learning of language model.
In this work, we revisit the self-training technique for language model fine-tuning and present a state-of-the-art prompt-based few-shot learner, SFLM.
arXiv Detail & Related papers (2021-10-04T08:51:36Z) - Hierarchical Text Classification of Urdu News using Deep Neural Network [0.0]
This paper proposes a deep learning model for hierarchical text classification of news in Urdu language.
It consists of 51,325 sentences from 8 online news websites belonging to the following genres: Sports; Technology; and Entertainment.
arXiv Detail & Related papers (2021-07-07T11:06:11Z) - Reinforced Iterative Knowledge Distillation for Cross-Lingual Named
Entity Recognition [54.92161571089808]
Cross-lingual NER transfers knowledge from rich-resource language to languages with low resources.
Existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages.
We develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning.
arXiv Detail & Related papers (2021-06-01T05:46:22Z) - An Attention Ensemble Approach for Efficient Text Classification of
Indian Languages [0.0]
This paper focuses on the coarse-grained technical domain identification of short text documents in Marathi, a Devanagari script-based Indian language.
A hybrid CNN-BiLSTM attention ensemble model is proposed that competently combines the intermediate sentence representations generated by the convolutional neural network and the bidirectional long short-term memory, leading to efficient text classification.
Experimental results show that the proposed model outperforms various baseline machine learning and deep learning models in the given task, giving the best validation accuracy of 89.57% and f1-score of 0.8875.
arXiv Detail & Related papers (2021-02-20T07:31:38Z) - Explicit Alignment Objectives for Multilingual Bidirectional Encoders [111.65322283420805]
We present a new method for learning multilingual encoders, AMBER (Aligned Multilingual Bi-directional EncodeR)
AMBER is trained on additional parallel data using two explicit alignment objectives that align the multilingual representations at different granularities.
Experimental results show that AMBER obtains gains of up to 1.1 average F1 score on sequence tagging and up to 27.3 average accuracy on retrieval over the XLMR-large model.
arXiv Detail & Related papers (2020-10-15T18:34:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.