Benchmark Performance of Machine And Deep Learning Based Methodologies
for Urdu Text Document Classification
- URL: http://arxiv.org/abs/2003.01345v1
- Date: Tue, 3 Mar 2020 05:49:55 GMT
- Title: Benchmark Performance of Machine And Deep Learning Based Methodologies
for Urdu Text Document Classification
- Authors: Muhammad Nabeel Asim, Muhammad Usman Ghani, Muhammad Ali Ibrahim,
Sheraz Ahmad, Waqar Mahmood, Andreas Dengel
- Abstract summary: This paper provides benchmark performance for Urdu text document classification.
It investigates the performance impact of traditional machine learning based Urdu text document classification methodologies.
For the very first time, it as-sesses the performance of various deep learning based methodologies for Urdu text document classification.
- Score: 4.1353427192071015
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In order to provide benchmark performance for Urdu text document
classification, the contribution of this paper is manifold. First, it pro-vides
a publicly available benchmark dataset manually tagged against 6 classes.
Second, it investigates the performance impact of traditional machine learning
based Urdu text document classification methodologies by embedding 10
filter-based feature selection algorithms which have been widely used for other
languages. Third, for the very first time, it as-sesses the performance of
various deep learning based methodologies for Urdu text document
classification. In this regard, for experimentation, we adapt 10 deep learning
classification methodologies which have pro-duced best performance figures for
English text classification. Fourth, it also investigates the performance
impact of transfer learning by utiliz-ing Bidirectional Encoder Representations
from Transformers approach for Urdu language. Fifth, it evaluates the integrity
of a hybrid approach which combines traditional machine learning based feature
engineering and deep learning based automated feature engineering. Experimental
results show that feature selection approach named as Normalised Dif-ference
Measure along with Support Vector Machine outshines state-of-the-art
performance on two closed source benchmark datasets CLE Urdu Digest 1000k, and
CLE Urdu Digest 1Million with a significant margin of 32%, and 13%
respectively. Across all three datasets, Normalised Differ-ence Measure
outperforms other filter based feature selection algorithms as it significantly
uplifts the performance of all adopted machine learning, deep learning, and
hybrid approaches. The source code and presented dataset are available at
Github repository.
Related papers
- Stress Detection on Code-Mixed Texts in Dravidian Languages using Machine Learning [0.0]
Stress is a common feeling in daily life, but it can affect mental well-being in some situations.
This study introduces a methodical approach to the stress identification in code-mixed texts for Dravidian languages.
arXiv Detail & Related papers (2024-10-08T23:49:31Z) - Feature Extraction Using Deep Generative Models for Bangla Text
Classification on a New Comprehensive Dataset [0.0]
Despite being the sixth most widely spoken language in the world, Bangla has received little attention due to the scarcity of text datasets.
We collected, annotated, and prepared a comprehensive dataset of 212,184 Bangla documents in seven different categories and made it publicly accessible.
arXiv Detail & Related papers (2023-08-21T22:18:09Z) - T3L: Translate-and-Test Transfer Learning for Cross-Lingual Text
Classification [50.675552118811]
Cross-lingual text classification is typically built on large-scale, multilingual language models (LMs) pretrained on a variety of languages of interest.
We propose revisiting the classic "translate-and-test" pipeline to neatly separate the translation and classification stages.
arXiv Detail & Related papers (2023-06-08T07:33:22Z) - Enhancing Pashto Text Classification using Language Processing
Techniques for Single And Multi-Label Analysis [0.0]
This study aims to establish an automated classification system for Pashto text.
The study achieved an average testing accuracy rate of 94%.
The use of pre-trained language representation models, such as DistilBERT, showed promising results.
arXiv Detail & Related papers (2023-05-04T23:11:31Z) - Zero-Shot Listwise Document Reranking with a Large Language Model [58.64141622176841]
We propose Listwise Reranker with a Large Language Model (LRL), which achieves strong reranking effectiveness without using any task-specific training data.
Experiments on three TREC web search datasets demonstrate that LRL not only outperforms zero-shot pointwise methods when reranking first-stage retrieval results, but can also act as a final-stage reranker.
arXiv Detail & Related papers (2023-05-03T14:45:34Z) - Bag of Tricks for Training Data Extraction from Language Models [98.40637430115204]
We investigate and benchmark tricks for improving training data extraction using a publicly available dataset.
The experimental results show that several previously overlooked tricks can be crucial to the success of training data extraction.
arXiv Detail & Related papers (2023-02-09T06:46:42Z) - UniTE: Unified Translation Evaluation [63.58868113074476]
UniTE is the first unified framework engaged with abilities to handle all three evaluation tasks.
We testify our framework on WMT 2019 Metrics and WMT 2020 Quality Estimation benchmarks.
arXiv Detail & Related papers (2022-04-28T08:35:26Z) - LeQua@CLEF2022: Learning to Quantify [76.22817970624875]
LeQua 2022 is a new lab for the evaluation of methods for learning to quantify'' in textual datasets.
The goal of this lab is to provide a setting for the comparative evaluation of methods for learning to quantify, both in the binary setting and in the single-label multiclass setting.
arXiv Detail & Related papers (2021-11-22T14:54:20Z) - An Attention Ensemble Approach for Efficient Text Classification of
Indian Languages [0.0]
This paper focuses on the coarse-grained technical domain identification of short text documents in Marathi, a Devanagari script-based Indian language.
A hybrid CNN-BiLSTM attention ensemble model is proposed that competently combines the intermediate sentence representations generated by the convolutional neural network and the bidirectional long short-term memory, leading to efficient text classification.
Experimental results show that the proposed model outperforms various baseline machine learning and deep learning models in the given task, giving the best validation accuracy of 89.57% and f1-score of 0.8875.
arXiv Detail & Related papers (2021-02-20T07:31:38Z) - Hierarchical Bi-Directional Self-Attention Networks for Paper Review
Rating Recommendation [81.55533657694016]
We propose a Hierarchical bi-directional self-attention Network framework (HabNet) for paper review rating prediction and recommendation.
Specifically, we leverage the hierarchical structure of the paper reviews with three levels of encoders: sentence encoder (level one), intra-review encoder (level two) and inter-review encoder (level three)
We are able to identify useful predictors to make the final acceptance decision, as well as to help discover the inconsistency between numerical review ratings and text sentiment conveyed by reviewers.
arXiv Detail & Related papers (2020-11-02T08:07:50Z) - A Precisely Xtreme-Multi Channel Hybrid Approach For Roman Urdu
Sentiment Analysis [0.8812173669205371]
This paper provides 3 neural word embeddings prepared using most widely used approaches namely Word2vec, FastText, and Glove.
Considering the lack of publicly available benchmark datasets, it provides a first-ever Roman Urdu dataset which consists of 3241 sentiments annotated against positive, negative and neutral classes.
It proposes a novel precisely extreme multi-channel hybrid methodology which outperforms state-of-the-art adapted machine and deep learning approaches by the figure of 9%, and 4% in terms of F1-score.
arXiv Detail & Related papers (2020-03-11T04:08:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.