Related papers: Web Document Categorization Using Naive Bayes Classifier and Latent Semantic Analysis

Web Document Categorization Using Naive Bayes Classifier and Latent Semantic Analysis

URL: http://arxiv.org/abs/2006.01715v1
Date: Tue, 2 Jun 2020 15:35:05 GMT
Title: Web Document Categorization Using Naive Bayes Classifier and Latent Semantic Analysis
Authors: Alireza Saleh Sedghpour, Mohammad Reza Saleh Sedghpour
Abstract summary: A rapid growth of web documents necessitates efficient techniques to efficiently classify the document on the web. We propose a method for web document classification that uses LSA to increase similarity of documents under the same class and improve the classification precision. Experimental results have shown that using the mentioned preprocessing can improve accuracy and speed of Naive Bayes availably.
Score: 0.7310043452300736
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: A rapid growth of web documents due to heavy use of World Wide Web necessitates efficient techniques to efficiently classify the document on the web. It is thus produced High volumes of data per second with high diversity. Automatically classification of these growing amounts of web document is One of the biggest challenges facing us today. Probabilistic classification algorithms such as Naive Bayes have become commonly used for web document classification. This problem is mainly because of the irrelatively high classification accuracy on plenty application areas as well as their lack of support to handle high dimensional and sparse data which is the exclusive characteristics of textual data representation. also it is common to Lack of attention and support the semantic relation between words using traditional feature selection method When dealing with the big data and large-scale web documents. In order to solve the problem, we proposed a method for web document classification that uses LSA to increase similarity of documents under the same class and improve the classification precision. Using this approach, we designed a faster and much accurate classifier for Web Documents. Experimental results have shown that using the mentioned preprocessing can improve accuracy and speed of Naive Bayes availably, the precision and recall metrics have indicated the improvement.

Related papers

DocVCE: Diffusion-based Visual Counterfactual Explanations for Document Image Classification [5.247930659596986]
We introduce generative document counterfactuals that provide meaningful insights into the model's decision-making through actionable explanations.<n>To the best of the authors' knowledge, this is the first work to explore generative counterfactual explanations in document image analysis.
arXiv Detail & Related papers (2025-08-06T09:15:32Z)
Learning Refined Document Representations for Dense Retrieval via Deliberate Thinking [58.69615583599489]
Deliberate Thinking based Retriever (Debater) is a novel approach that enhances document representations by incorporating a step-by-step thinking process.<n>Debater significantly outperforms existing methods across several retrieval benchmarks.
arXiv Detail & Related papers (2025-02-18T15:56:34Z)
Document Type Classification using File Names [7.130525292849283]
Rapid document classification is critical in several time-sensitive applications like digital forensics and large-scale media classification. Traditional approaches that rely on heavy-duty deep learning models fall short due to high inference times over vast input datasets. We present a method using lightweight supervised learning models, combined with a TF-IDF feature extraction-based tokenization method.
arXiv Detail & Related papers (2024-10-02T01:42:19Z)
Perplexed by Quality: A Perplexity-based Method for Adult and Harmful Content Detection in Multilingual Heterogeneous Web Data [0.0]
We explore different methods for detecting adult and harmful of content in multilingual heterogeneous web data. We train solely with adult and harmful textual data, and then select the documents having a perplexity value above a given threshold. This approach will virtually cluster our documents into two distinct groups, which will greatly facilitate the choice of the threshold for the perplexity.
arXiv Detail & Related papers (2022-12-20T17:14:45Z)
Learning Diverse Document Representations with Deep Query Interactions for Dense Retrieval [79.37614949970013]
We propose a new dense retrieval model which learns diverse document representations with deep query interactions. Our model encodes each document with a set of generated pseudo-queries to get query-informed, multi-view document representations.
arXiv Detail & Related papers (2022-08-08T16:00:55Z)
GERE: Generative Evidence Retrieval for Fact Verification [57.78768817972026]
We propose GERE, the first system that retrieves evidences in a generative fashion. The experimental results on the FEVER dataset show that GERE achieves significant improvements over the state-of-the-art baselines.
arXiv Detail & Related papers (2022-04-12T03:49:35Z)
Efficient Classification of Long Documents Using Transformers [13.927622630633344]
We evaluate the relative efficacy measured against various baselines and diverse datasets. Results show that more complex models often fail to outperform simple baselines and yield inconsistent performance across datasets.
arXiv Detail & Related papers (2022-03-21T18:36:18Z)
Augmenting Document Representations for Dense Retrieval with Interpolation and Perturbation [49.940525611640346]
Document Augmentation for dense Retrieval (DAR) framework augments the representations of documents with their Dense Augmentation and perturbations. We validate the performance of DAR on retrieval tasks with two benchmark datasets, showing that the proposed DAR significantly outperforms relevant baselines on the dense retrieval of both the labeled and unlabeled documents.
arXiv Detail & Related papers (2022-03-15T09:07:38Z)
Long Document Summarization with Top-down and Bottom-up Inference [113.29319668246407]
We propose a principled inference framework to improve summarization models on two aspects. Our framework assumes a hierarchical latent structure of a document where the top-level captures the long range dependency. We demonstrate the effectiveness of the proposed framework on a diverse set of summarization datasets.
arXiv Detail & Related papers (2022-03-15T01:24:51Z)
An Efficient and Accurate Rough Set for Feature Selection, Classification and Knowledge Representation [89.5951484413208]
This paper present a strong data mining method based on rough set, which can realize feature selection, classification and knowledge representation at the same time. We first find the ineffectiveness of rough set because of overfitting, especially in processing noise attribute, and propose a robust measurement for an attribute, called relative importance. Experimental results on public benchmark data sets show that the proposed framework achieves higher accurcy than seven popular or the state-of-the-art feature selection methods.
arXiv Detail & Related papers (2021-12-29T12:45:49Z)
Comparative Study of Long Document Classification [0.0]
We revisit long document classification using standard machine learning approaches. We benchmark approaches ranging from simple Naive Bayes to complex BERT on six standard text classification datasets.
arXiv Detail & Related papers (2021-11-01T04:51:51Z)
Conical Classification For Computationally Efficient One-Class Topic Determination [0.0]
We propose a Conical classification approach to identify documents that relate to a particular topic. We show in our analysis that our approach has higher predictive power on our datasets, and is also faster to compute.
arXiv Detail & Related papers (2021-10-31T01:27:12Z)
Be More with Less: Hypergraph Attention Networks for Inductive Text Classification [56.98218530073927]
Graph neural networks (GNNs) have received increasing attention in the research community and demonstrated their promising results on this canonical task. Despite the success, their performance could be largely jeopardized in practice since they are unable to capture high-order interaction between words. We propose a principled model -- hypergraph attention networks (HyperGAT) which can obtain more expressive power with less computational consumption for text representation learning.
arXiv Detail & Related papers (2020-11-01T00:21:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.