Text Classification Using Hybrid Machine Learning Algorithms on Big Data
- URL: http://arxiv.org/abs/2103.16624v1
- Date: Tue, 30 Mar 2021 19:02:48 GMT
- Title: Text Classification Using Hybrid Machine Learning Algorithms on Big Data
- Authors: D.C. Asogwa, S.O. Anigbogu, I.E. Onyenwe, F.A. Sani
- Abstract summary: In this work, two supervised machine learning algorithms are combined with text mining techniques to produce a hybrid model.
The result shows that the hybrid model gave 96.76% accuracy as against the 61.45% and 69.21% of the Na"ive Bayes and SVM models respectively.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, there are unprecedented data growth originating from different
online platforms which contribute to big data in terms of volume, velocity,
variety and veracity (4Vs). Given this nature of big data which is
unstructured, performing analytics to extract meaningful information is
currently a great challenge to big data analytics. Collecting and analyzing
unstructured textual data allows decision makers to study the escalation of
comments/posts on our social media platforms. Hence, there is need for
automatic big data analysis to overcome the noise and the non-reliability of
these unstructured dataset from the digital media platforms. However, current
machine learning algorithms used are performance driven focusing on the
classification/prediction accuracy based on known properties learned from the
training samples. With the learning task in a large dataset, most machine
learning models are known to require high computational cost which eventually
leads to computational complexity. In this work, two supervised machine
learning algorithms are combined with text mining techniques to produce a
hybrid model which consists of Na\"ive Bayes and support vector machines (SVM).
This is to increase the efficiency and accuracy of the results obtained and
also to reduce the computational cost and complexity. The system also provides
an open platform where a group of persons with a common interest can share
their comments/messages and these comments classified automatically as legal or
illegal. This improves the quality of conversation among users. The hybrid
model was developed using WEKA tools and Java programming language. The result
shows that the hybrid model gave 96.76% accuracy as against the 61.45% and
69.21% of the Na\"ive Bayes and SVM models respectively.
Related papers
- Scaling Laws for Data Filtering -- Data Curation cannot be Compute Agnostic [99.3682210827572]
Vision-language models (VLMs) are trained for thousands of GPU hours on carefully curated web datasets.
Data curation strategies are typically developed agnostic of the available compute for training.
We introduce neural scaling laws that account for the non-homogeneous nature of web data.
arXiv Detail & Related papers (2024-04-10T17:27:54Z) - A Novel Neural Network-Based Federated Learning System for Imbalanced
and Non-IID Data [2.9642661320713555]
Most machine learning algorithms rely heavily on large amount of data which may be collected from various sources.
To combat this issue, researchers have introduced federated learning, where a prediction model is learnt by ensuring the privacy of data of clients data.
In this research, we propose a centralized, neural network-based federated learning system.
arXiv Detail & Related papers (2023-11-16T17:14:07Z) - Training Deep Surrogate Models with Large Scale Online Learning [48.7576911714538]
Deep learning algorithms have emerged as a viable alternative for obtaining fast solutions for PDEs.
Models are usually trained on synthetic data generated by solvers, stored on disk and read back for training.
It proposes an open source online training framework for deep surrogate models.
arXiv Detail & Related papers (2023-06-28T12:02:27Z) - Deep Sequence Models for Text Classification Tasks [0.007329200485567826]
Natural Language Processing (NLP) is equipping machines to understand human diverse and complicated languages.
Common text classification application includes information retrieval, modeling news topic, theme extraction, sentiment analysis, and spam detection.
Sequence models such as RNN, GRU, and LSTM is a breakthrough for tasks with long-range dependencies.
Results generated were excellent with most of the models performing within the range of 80% and 94%.
arXiv Detail & Related papers (2022-07-18T18:47:18Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z) - Improving Classifier Training Efficiency for Automatic Cyberbullying
Detection with Feature Density [58.64907136562178]
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods.
We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments.
The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
arXiv Detail & Related papers (2021-11-02T15:48:28Z) - Self-service Data Classification Using Interactive Visualization and
Interpretable Machine Learning [9.13755431537592]
Iterative Visual Logical (IVLC) is an interpretable machine learning algorithm.
IVLC is especially helpful when dealing with sensitive and crucial data like cancer data in the medical domain.
This chapter proposes an automated classification approach combined with new Coordinate Order (COO) algorithm and genetic algorithm.
arXiv Detail & Related papers (2021-07-11T05:39:14Z) - ALT-MAS: A Data-Efficient Framework for Active Testing of Machine
Learning Algorithms [58.684954492439424]
We propose a novel framework to efficiently test a machine learning model using only a small amount of labeled test data.
The idea is to estimate the metrics of interest for a model-under-test using Bayesian neural network (BNN)
arXiv Detail & Related papers (2021-04-11T12:14:04Z) - Synthetic Data: Opening the data floodgates to enable faster, more
directed development of machine learning methods [96.92041573661407]
Many ground-breaking advancements in machine learning can be attributed to the availability of a large volume of rich data.
Many large-scale datasets are highly sensitive, such as healthcare data, and are not widely available to the machine learning community.
Generating synthetic data with privacy guarantees provides one such solution.
arXiv Detail & Related papers (2020-12-08T17:26:10Z) - It's the Best Only When It Fits You Most: Finding Related Models for
Serving Based on Dynamic Locality Sensitive Hashing [1.581913948762905]
Preparation of training data is often a bottleneck in the lifecycle of deploying a deep learning model for production or research.
This paper proposes an end-to-end process of searching related models for serving based on the similarity of the target dataset and the training datasets of the available models.
arXiv Detail & Related papers (2020-10-13T22:52:13Z) - ARDA: Automatic Relational Data Augmentation for Machine Learning [23.570173866941612]
We present system, an end-to-end system that takes as input a dataset and a data repository, and outputs an augmented data set.
Our system has two distinct components: (1) a framework to search and join data with the input data, based on various attributes of the input, and (2) an efficient feature selection algorithm that prunes out noisy or irrelevant features from the resulting join.
arXiv Detail & Related papers (2020-03-21T21:55:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.