Related papers: A comprehensive study on Frequent Pattern Mining and Clustering categories for topic detection in Persian text stream

A comprehensive study on Frequent Pattern Mining and Clustering categories for topic detection in Persian text stream

URL: http://arxiv.org/abs/2403.10237v1
Date: Fri, 15 Mar 2024 12:08:58 GMT
Title: A comprehensive study on Frequent Pattern Mining and Clustering categories for topic detection in Persian text stream
Authors: Elnaz Zafarani-Moattar, Mohammad Reza Kangavari, Amir Masoud Rahmani,
Abstract summary: The aim of this study is to conduct an extensive study on the best algorithms for topic detection. The text of Persian social network posts is used as the dataset. The results indicate that if we are searching for keyword-topics that are easily understandable by humans, the hybrid category is better.
Score: 6.446062819763263
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Topic detection is a complex process and depends on language because it somehow needs to analyze text. There have been few studies on topic detection in Persian, and the existing algorithms are not remarkable. Therefore, we aimed to study topic detection in Persian. The objectives of this study are: 1) to conduct an extensive study on the best algorithms for topic detection, 2) to identify necessary adaptations to make these algorithms suitable for the Persian language, and 3) to evaluate their performance on Persian social network texts. To achieve these objectives, we have formulated two research questions: First, considering the lack of research in Persian, what modifications should be made to existing frameworks, especially those developed in English, to make them compatible with Persian? Second, how do these algorithms perform, and which one is superior? There are various topic detection methods that can be categorized into different categories. Frequent pattern and clustering are selected for this research, and a hybrid of both is proposed as a new category. Then, ten methods from these three categories are selected. All of them are re-implemented from scratch, changed, and adapted with Persian. These ten methods encompass different types of topic detection methods and have shown good performance in English. The text of Persian social network posts is used as the dataset. Additionally, a new multiclass evaluation criterion, called FS, is used in this paper for the first time in the field of topic detection. Approximately 1.4 billion tokens are processed during experiments. The results indicate that if we are searching for keyword-topics that are easily understandable by humans, the hybrid category is better. However, if the aim is to cluster posts for further analysis, the frequent pattern category is more suitable.

Related papers

Category Discovery: An Open-World Perspective [17.624912732260672]
Category discovery (CD) is an emerging open-world learning task, which aims at automatically categorizing unlabelled data.<n>We provide a comprehensive review of the literature, and offer detailed analysis and in-depth discussion on different methods.<n>We distill key insights showing that large-scale pretrained backbones, hierarchical and auxiliary cues, and curriculum-style training are all beneficial for category discovery.
arXiv Detail & Related papers (2025-09-26T16:19:05Z)
Regularization-Based Methods for Ordinal Quantification [49.606912965922504]
We study the ordinal case, i.e., the case in which a total order is defined on the set of n>2 classes. We propose a novel class of regularized OQ algorithms, which outperforms existing algorithms in our experiments.
arXiv Detail & Related papers (2023-10-13T16:04:06Z)
Tuning Traditional Language Processing Approaches for Pashto Text Classification [0.0]
The main aim of this study is to establish a Pashto automatic text classification system. This study compares several models containing both statistical and neural network machine learning techniques. This research obtained average testing accuracy rate 94% using classification algorithm and TFIDF feature extraction method.
arXiv Detail & Related papers (2023-05-04T22:57:45Z)
A Gold Standard Dataset for the Reviewer Assignment Problem [117.59690218507565]
"Similarity score" is a numerical estimate of the expertise of a reviewer in reviewing a paper. Our dataset consists of 477 self-reported expertise scores provided by 58 researchers. For the task of ordering two papers in terms of their relevance for a reviewer, the error rates range from 12%-30% in easy cases to 36%-43% in hard cases.
arXiv Detail & Related papers (2023-03-23T16:15:03Z)
Persian topic detection based on Human Word association and graph embedding [3.8137985834223507]
We propose a framework to detect topics in social media based on Human Word Association. Most of the work done in this area is in English, but much has been done in the Persian language.
arXiv Detail & Related papers (2023-02-20T05:46:47Z)
Toward the Understanding of Deep Text Matching Models for Information Retrieval [72.72380690535766]
This paper aims at testing whether existing deep text matching methods satisfy some fundamental gradients in information retrieval. Specifically, four attributions are used in our study, i.e., term frequency constraint, term discrimination constraint, length normalization constraints, and TF-length constraint. Experimental results on LETOR 4.0 and MS Marco show that all the investigated deep text matching methods satisfy the above constraints with high probabilities in statistics.
arXiv Detail & Related papers (2021-08-16T13:33:15Z)
The Challenges of Persian User-generated Textual Content: A Machine Learning-Based Approach [0.0]
This research applies machine learning-based approaches to tackle the hurdles that come with Persian user-generated textual content. The presented approach uses a machine-translated datasets to conduct sentiment analysis for the Persian language. The results of the experiments have shown promising state-of-the-art performance in contrast to the previous efforts.
arXiv Detail & Related papers (2021-01-20T11:57:59Z)
Be More with Less: Hypergraph Attention Networks for Inductive Text Classification [56.98218530073927]
Graph neural networks (GNNs) have received increasing attention in the research community and demonstrated their promising results on this canonical task. Despite the success, their performance could be largely jeopardized in practice since they are unable to capture high-order interaction between words. We propose a principled model -- hypergraph attention networks (HyperGAT) which can obtain more expressive power with less computational consumption for text representation learning.
arXiv Detail & Related papers (2020-11-01T00:21:59Z)
Weakly-Supervised Aspect-Based Sentiment Analysis via Joint Aspect-Sentiment Topic Embedding [71.2260967797055]
We propose a weakly-supervised approach for aspect-based sentiment analysis. We learn sentiment, aspect> joint topic embeddings in the word embedding space. We then use neural models to generalize the word-level discriminative information.
arXiv Detail & Related papers (2020-10-13T21:33:24Z)
Polysemy Deciphering Network for Robust Human-Object Interaction Detection [86.97181280842098]
We propose a novel Polysemy Deciphering Network (PD-Net) that decodes the visual polysemy of verbs for HOI detection. We refine features for HOI detection to be polysemyaware through the use of two novel modules. Second, we introduce a novel Polysemy-Aware Modal Fusion module (PAMF) which guides PD-Net to make decisions based on feature types deemed more important according to the language priors.
arXiv Detail & Related papers (2020-08-07T00:49:27Z)
A Survey on Text Classification: From Shallow to Deep Learning [83.47804123133719]
The last decade has seen a surge of research in this area due to the unprecedented success of deep learning. This paper fills the gap by reviewing the state-of-the-art approaches from 1961 to 2021. We create a taxonomy for text classification according to the text involved and the models used for feature extraction and classification.
arXiv Detail & Related papers (2020-08-02T00:09:03Z)
A novel approach to sentiment analysis in Persian using discourse and external semantic information [0.0]
Many approaches have been proposed to extract the sentiment of individuals from documents written in natural languages. The majority of these approaches have focused on English, while resource-lean languages such as Persian suffer from the lack of research work and language resources. Due to this gap in Persian, the current work is accomplished to introduce new methods for sentiment analysis which have been applied on Persian.
arXiv Detail & Related papers (2020-07-18T18:40:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.