Text classification dataset and analysis for Uzbek language
- URL: http://arxiv.org/abs/2302.14494v1
- Date: Tue, 28 Feb 2023 11:21:24 GMT
- Title: Text classification dataset and analysis for Uzbek language
- Authors: Elmurod Kuriyozov, Ulugbek Salaev, Sanatbek Matlatipov, Gayrat
Matlatipov
- Abstract summary: We first present a newly obtained dataset for Uzbek text classification, which was collected from 10 different news and press websites.
We also present a comprehensive evaluation of different models, ranging from traditional bag-of-words models to deep learning architectures.
Our experiments show that the Recurrent Neural Network (RNN) and Convolutional Neural Network (CNN) based models outperform the rule-based models.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text classification is an important task in Natural Language Processing
(NLP), where the goal is to categorize text data into predefined classes. In
this study, we analyse the dataset creation steps and evaluation techniques of
multi-label news categorisation task as part of text classification. We first
present a newly obtained dataset for Uzbek text classification, which was
collected from 10 different news and press websites and covers 15 categories of
news, press and law texts. We also present a comprehensive evaluation of
different models, ranging from traditional bag-of-words models to deep learning
architectures, on this newly created dataset. Our experiments show that the
Recurrent Neural Network (RNN) and Convolutional Neural Network (CNN) based
models outperform the rule-based models. The best performance is achieved by
the BERTbek model, which is a transformer-based BERT model trained on the Uzbek
corpus. Our findings provide a good baseline for further research in Uzbek text
classification.
Related papers
- Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions [62.12545440385489]
Large language models (LLMs) have brought substantial advancements in text generation, but their potential for enhancing classification tasks remains underexplored.
We propose a framework for thoroughly investigating fine-tuning LLMs for classification, including both generation- and encoding-based approaches.
We instantiate this framework in edit intent classification (EIC), a challenging and underexplored classification task.
arXiv Detail & Related papers (2024-10-02T20:48:28Z) - Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - Benchmarking Multilabel Topic Classification in the Kyrgyz Language [6.15353988889181]
We present a new public benchmark for topic classification in Kyrgyz based on collected and annotated data from the news site 24.KG.
We train and evaluate both classical statistical and neural models, reporting the scores, discussing the results, and proposing directions for future work.
arXiv Detail & Related papers (2023-08-30T11:02:26Z) - A Dataset and Strong Baselines for Classification of Czech News Texts [0.0]
We present CZEchNEwsClassificationdataset (CZE-NEC), one of the largest Czech classification datasets.
We define four classification tasks: news source, news category, inferred author's gender, and day of the week.
We show that language-specific pre-trained encoder analysis outperforms selected commercially available large-scale generative language models.
arXiv Detail & Related papers (2023-07-20T07:47:08Z) - SLCNN: Sentence-Level Convolutional Neural Network for Text
Classification [0.0]
Convolutional neural network (CNN) has shown remarkable success in the task of text classification.
New baseline models have been studied for text classification using CNN.
Results have shown that the proposed models have better performance, particularly in the longer documents.
arXiv Detail & Related papers (2023-01-27T13:16:02Z) - A semantic hierarchical graph neural network for text classification [1.439766998338892]
We propose a new hierarchical graph neural network (HieGNN) which extracts corresponding information from word-level, sentence-level and document-level respectively.
Experimental results on several benchmark datasets achieve better or similar results compared to several baseline methods.
arXiv Detail & Related papers (2022-09-15T03:59:31Z) - Pre-training Language Model Incorporating Domain-specific Heterogeneous Knowledge into A Unified Representation [49.89831914386982]
We propose a unified pre-trained language model (PLM) for all forms of text, including unstructured text, semi-structured text, and well-structured text.
Our approach outperforms the pre-training of plain text using only 1/4 of the data.
arXiv Detail & Related papers (2021-09-02T16:05:24Z) - GraphFormers: GNN-nested Transformers for Representation Learning on
Textual Graph [53.70520466556453]
We propose GraphFormers, where layerwise GNN components are nested alongside the transformer blocks of language models.
With the proposed architecture, the text encoding and the graph aggregation are fused into an iterative workflow.
In addition, a progressive learning strategy is introduced, where the model is successively trained on manipulated data and original data to reinforce its capability of integrating information on graph.
arXiv Detail & Related papers (2021-05-06T12:20:41Z) - Minimally-Supervised Structure-Rich Text Categorization via Learning on
Text-Rich Networks [61.23408995934415]
We propose a novel framework for minimally supervised categorization by learning from the text-rich network.
Specifically, we jointly train two modules with different inductive biases -- a text analysis module for text understanding and a network learning module for class-discriminative, scalable network learning.
Our experiments show that given only three seed documents per category, our framework can achieve an accuracy of about 92%.
arXiv Detail & Related papers (2021-02-23T04:14:34Z) - Deep Learning Based Text Classification: A Comprehensive Review [75.8403533775179]
We provide a review of more than 150 deep learning based models for text classification developed in recent years.
We also provide a summary of more than 40 popular datasets widely used for text classification.
arXiv Detail & Related papers (2020-04-06T02:00:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.