A comprehensive study on Frequent Pattern Mining and Clustering categories for topic detection in Persian text stream
- URL: http://arxiv.org/abs/2403.10237v1
- Date: Fri, 15 Mar 2024 12:08:58 GMT
- Title: A comprehensive study on Frequent Pattern Mining and Clustering categories for topic detection in Persian text stream
- Authors: Elnaz Zafarani-Moattar, Mohammad Reza Kangavari, Amir Masoud Rahmani,
- Abstract summary: The aim of this study is to conduct an extensive study on the best algorithms for topic detection.
The text of Persian social network posts is used as the dataset.
The results indicate that if we are searching for keyword-topics that are easily understandable by humans, the hybrid category is better.
- Score: 6.446062819763263
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Topic detection is a complex process and depends on language because it somehow needs to analyze text. There have been few studies on topic detection in Persian, and the existing algorithms are not remarkable. Therefore, we aimed to study topic detection in Persian. The objectives of this study are: 1) to conduct an extensive study on the best algorithms for topic detection, 2) to identify necessary adaptations to make these algorithms suitable for the Persian language, and 3) to evaluate their performance on Persian social network texts. To achieve these objectives, we have formulated two research questions: First, considering the lack of research in Persian, what modifications should be made to existing frameworks, especially those developed in English, to make them compatible with Persian? Second, how do these algorithms perform, and which one is superior? There are various topic detection methods that can be categorized into different categories. Frequent pattern and clustering are selected for this research, and a hybrid of both is proposed as a new category. Then, ten methods from these three categories are selected. All of them are re-implemented from scratch, changed, and adapted with Persian. These ten methods encompass different types of topic detection methods and have shown good performance in English. The text of Persian social network posts is used as the dataset. Additionally, a new multiclass evaluation criterion, called FS, is used in this paper for the first time in the field of topic detection. Approximately 1.4 billion tokens are processed during experiments. The results indicate that if we are searching for keyword-topics that are easily understandable by humans, the hybrid category is better. However, if the aim is to cluster posts for further analysis, the frequent pattern category is more suitable.
Related papers
- Regularization-Based Methods for Ordinal Quantification [49.606912965922504]
We study the ordinal case, i.e., the case in which a total order is defined on the set of n>2 classes.
We propose a novel class of regularized OQ algorithms, which outperforms existing algorithms in our experiments.
arXiv Detail & Related papers (2023-10-13T16:04:06Z) - Tuning Traditional Language Processing Approaches for Pashto Text
Classification [0.0]
The main aim of this study is to establish a Pashto automatic text classification system.
This study compares several models containing both statistical and neural network machine learning techniques.
This research obtained average testing accuracy rate 94% using classification algorithm and TFIDF feature extraction method.
arXiv Detail & Related papers (2023-05-04T22:57:45Z) - A Gold Standard Dataset for the Reviewer Assignment Problem [117.59690218507565]
"Similarity score" is a numerical estimate of the expertise of a reviewer in reviewing a paper.
Our dataset consists of 477 self-reported expertise scores provided by 58 researchers.
For the task of ordering two papers in terms of their relevance for a reviewer, the error rates range from 12%-30% in easy cases to 36%-43% in hard cases.
arXiv Detail & Related papers (2023-03-23T16:15:03Z) - Persian topic detection based on Human Word association and graph
embedding [3.8137985834223507]
We propose a framework to detect topics in social media based on Human Word Association.
Most of the work done in this area is in English, but much has been done in the Persian language.
arXiv Detail & Related papers (2023-02-20T05:46:47Z) - Toward the Understanding of Deep Text Matching Models for Information
Retrieval [72.72380690535766]
This paper aims at testing whether existing deep text matching methods satisfy some fundamental gradients in information retrieval.
Specifically, four attributions are used in our study, i.e., term frequency constraint, term discrimination constraint, length normalization constraints, and TF-length constraint.
Experimental results on LETOR 4.0 and MS Marco show that all the investigated deep text matching methods satisfy the above constraints with high probabilities in statistics.
arXiv Detail & Related papers (2021-08-16T13:33:15Z) - The Challenges of Persian User-generated Textual Content: A Machine
Learning-Based Approach [0.0]
This research applies machine learning-based approaches to tackle the hurdles that come with Persian user-generated textual content.
The presented approach uses a machine-translated datasets to conduct sentiment analysis for the Persian language.
The results of the experiments have shown promising state-of-the-art performance in contrast to the previous efforts.
arXiv Detail & Related papers (2021-01-20T11:57:59Z) - Be More with Less: Hypergraph Attention Networks for Inductive Text
Classification [56.98218530073927]
Graph neural networks (GNNs) have received increasing attention in the research community and demonstrated their promising results on this canonical task.
Despite the success, their performance could be largely jeopardized in practice since they are unable to capture high-order interaction between words.
We propose a principled model -- hypergraph attention networks (HyperGAT) which can obtain more expressive power with less computational consumption for text representation learning.
arXiv Detail & Related papers (2020-11-01T00:21:59Z) - Polysemy Deciphering Network for Robust Human-Object Interaction
Detection [86.97181280842098]
We propose a novel Polysemy Deciphering Network (PD-Net) that decodes the visual polysemy of verbs for HOI detection.
We refine features for HOI detection to be polysemyaware through the use of two novel modules.
Second, we introduce a novel Polysemy-Aware Modal Fusion module (PAMF) which guides PD-Net to make decisions based on feature types deemed more important according to the language priors.
arXiv Detail & Related papers (2020-08-07T00:49:27Z) - A Survey on Text Classification: From Shallow to Deep Learning [83.47804123133719]
The last decade has seen a surge of research in this area due to the unprecedented success of deep learning.
This paper fills the gap by reviewing the state-of-the-art approaches from 1961 to 2021.
We create a taxonomy for text classification according to the text involved and the models used for feature extraction and classification.
arXiv Detail & Related papers (2020-08-02T00:09:03Z) - A novel approach to sentiment analysis in Persian using discourse and
external semantic information [0.0]
Many approaches have been proposed to extract the sentiment of individuals from documents written in natural languages.
The majority of these approaches have focused on English, while resource-lean languages such as Persian suffer from the lack of research work and language resources.
Due to this gap in Persian, the current work is accomplished to introduce new methods for sentiment analysis which have been applied on Persian.
arXiv Detail & Related papers (2020-07-18T18:40:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.