Web Document Categorization Using Naive Bayes Classifier and Latent
Semantic Analysis
- URL: http://arxiv.org/abs/2006.01715v1
- Date: Tue, 2 Jun 2020 15:35:05 GMT
- Title: Web Document Categorization Using Naive Bayes Classifier and Latent
Semantic Analysis
- Authors: Alireza Saleh Sedghpour, Mohammad Reza Saleh Sedghpour
- Abstract summary: A rapid growth of web documents necessitates efficient techniques to efficiently classify the document on the web.
We propose a method for web document classification that uses LSA to increase similarity of documents under the same class and improve the classification precision.
Experimental results have shown that using the mentioned preprocessing can improve accuracy and speed of Naive Bayes availably.
- Score: 0.7310043452300736
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: A rapid growth of web documents due to heavy use of World Wide Web
necessitates efficient techniques to efficiently classify the document on the
web. It is thus produced High volumes of data per second with high diversity.
Automatically classification of these growing amounts of web document is One of
the biggest challenges facing us today. Probabilistic classification algorithms
such as Naive Bayes have become commonly used for web document classification.
This problem is mainly because of the irrelatively high classification accuracy
on plenty application areas as well as their lack of support to handle high
dimensional and sparse data which is the exclusive characteristics of textual
data representation. also it is common to Lack of attention and support the
semantic relation between words using traditional feature selection method When
dealing with the big data and large-scale web documents. In order to solve the
problem, we proposed a method for web document classification that uses LSA to
increase similarity of documents under the same class and improve the
classification precision. Using this approach, we designed a faster and much
accurate classifier for Web Documents. Experimental results have shown that
using the mentioned preprocessing can improve accuracy and speed of Naive Bayes
availably, the precision and recall metrics have indicated the improvement.
Related papers
- Document Type Classification using File Names [7.130525292849283]
Rapid document classification is critical in several time-sensitive applications like digital forensics and large-scale media classification.
Traditional approaches that rely on heavy-duty deep learning models fall short due to high inference times over vast input datasets.
We present a method using lightweight supervised learning models, combined with a TF-IDF feature extraction-based tokenization method.
arXiv Detail & Related papers (2024-10-02T01:42:19Z) - Perplexed by Quality: A Perplexity-based Method for Adult and Harmful
Content Detection in Multilingual Heterogeneous Web Data [0.0]
We explore different methods for detecting adult and harmful of content in multilingual heterogeneous web data.
We train solely with adult and harmful textual data, and then select the documents having a perplexity value above a given threshold.
This approach will virtually cluster our documents into two distinct groups, which will greatly facilitate the choice of the threshold for the perplexity.
arXiv Detail & Related papers (2022-12-20T17:14:45Z) - Learning Diverse Document Representations with Deep Query Interactions
for Dense Retrieval [79.37614949970013]
We propose a new dense retrieval model which learns diverse document representations with deep query interactions.
Our model encodes each document with a set of generated pseudo-queries to get query-informed, multi-view document representations.
arXiv Detail & Related papers (2022-08-08T16:00:55Z) - GERE: Generative Evidence Retrieval for Fact Verification [57.78768817972026]
We propose GERE, the first system that retrieves evidences in a generative fashion.
The experimental results on the FEVER dataset show that GERE achieves significant improvements over the state-of-the-art baselines.
arXiv Detail & Related papers (2022-04-12T03:49:35Z) - Efficient Classification of Long Documents Using Transformers [13.927622630633344]
We evaluate the relative efficacy measured against various baselines and diverse datasets.
Results show that more complex models often fail to outperform simple baselines and yield inconsistent performance across datasets.
arXiv Detail & Related papers (2022-03-21T18:36:18Z) - Augmenting Document Representations for Dense Retrieval with
Interpolation and Perturbation [49.940525611640346]
Document Augmentation for dense Retrieval (DAR) framework augments the representations of documents with their Dense Augmentation and perturbations.
We validate the performance of DAR on retrieval tasks with two benchmark datasets, showing that the proposed DAR significantly outperforms relevant baselines on the dense retrieval of both the labeled and unlabeled documents.
arXiv Detail & Related papers (2022-03-15T09:07:38Z) - Long Document Summarization with Top-down and Bottom-up Inference [113.29319668246407]
We propose a principled inference framework to improve summarization models on two aspects.
Our framework assumes a hierarchical latent structure of a document where the top-level captures the long range dependency.
We demonstrate the effectiveness of the proposed framework on a diverse set of summarization datasets.
arXiv Detail & Related papers (2022-03-15T01:24:51Z) - An Efficient and Accurate Rough Set for Feature Selection,
Classification and Knowledge Representation [89.5951484413208]
This paper present a strong data mining method based on rough set, which can realize feature selection, classification and knowledge representation at the same time.
We first find the ineffectiveness of rough set because of overfitting, especially in processing noise attribute, and propose a robust measurement for an attribute, called relative importance.
Experimental results on public benchmark data sets show that the proposed framework achieves higher accurcy than seven popular or the state-of-the-art feature selection methods.
arXiv Detail & Related papers (2021-12-29T12:45:49Z) - Comparative Study of Long Document Classification [0.0]
We revisit long document classification using standard machine learning approaches.
We benchmark approaches ranging from simple Naive Bayes to complex BERT on six standard text classification datasets.
arXiv Detail & Related papers (2021-11-01T04:51:51Z) - Conical Classification For Computationally Efficient One-Class Topic
Determination [0.0]
We propose a Conical classification approach to identify documents that relate to a particular topic.
We show in our analysis that our approach has higher predictive power on our datasets, and is also faster to compute.
arXiv Detail & Related papers (2021-10-31T01:27:12Z) - Be More with Less: Hypergraph Attention Networks for Inductive Text
Classification [56.98218530073927]
Graph neural networks (GNNs) have received increasing attention in the research community and demonstrated their promising results on this canonical task.
Despite the success, their performance could be largely jeopardized in practice since they are unable to capture high-order interaction between words.
We propose a principled model -- hypergraph attention networks (HyperGAT) which can obtain more expressive power with less computational consumption for text representation learning.
arXiv Detail & Related papers (2020-11-01T00:21:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.