A pipeline and comparative study of 12 machine learning models for text
classification
- URL: http://arxiv.org/abs/2204.06518v1
- Date: Mon, 4 Apr 2022 23:51:22 GMT
- Title: A pipeline and comparative study of 12 machine learning models for text
classification
- Authors: Annalisa Occhipinti, Louis Rogers, Claudio Angione
- Abstract summary: Text-based communication is highly favoured as a communication method, especially in business environments.
Many machine learning methods for text classification have been proposed and incorporated into the services of most email providers.
However, optimising text classification algorithms and finding the right tradeoff on their aggressiveness is still a major research problem.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Text-based communication is highly favoured as a communication method,
especially in business environments. As a result, it is often abused by sending
malicious messages, e.g., spam emails, to deceive users into relaying personal
information, including online accounts credentials or banking details. For this
reason, many machine learning methods for text classification have been
proposed and incorporated into the services of most email providers. However,
optimising text classification algorithms and finding the right tradeoff on
their aggressiveness is still a major research problem.
We present an updated survey of 12 machine learning text classifiers applied
to a public spam corpus. A new pipeline is proposed to optimise hyperparameter
selection and improve the models' performance by applying specific methods
(based on natural language processing) in the preprocessing stage.
Our study aims to provide a new methodology to investigate and optimise the
effect of different feature sizes and hyperparameters in machine learning
classifiers that are widely used in text classification problems. The
classifiers are tested and evaluated on different metrics including F-score
(accuracy), precision, recall, and run time. By analysing all these aspects, we
show how the proposed pipeline can be used to achieve a good accuracy towards
spam filtering on the Enron dataset, a widely used public email corpus.
Statistical tests and explainability techniques are applied to provide a robust
analysis of the proposed pipeline and interpret the classification outcomes of
the 12 machine learning models, also identifying words that drive the
classification results. Our analysis shows that it is possible to identify an
effective machine learning model to classify the Enron dataset with an F-score
of 94%.
Related papers
- Prompt-based Personality Profiling: Reinforcement Learning for Relevance Filtering [8.20929362102942]
Author profiling is the task of inferring characteristics about individuals by analyzing content they share.
We propose a new method for author profiling which aims at distinguishing relevant from irrelevant content first, followed by the actual user profiling only with relevant data.
We evaluate our method for Big Five personality trait prediction on two Twitter corpora.
arXiv Detail & Related papers (2024-09-06T08:43:10Z) - M-Tuning: Prompt Tuning with Mitigated Label Bias in Open-Set Scenarios [103.6153593636399]
We propose a vision-language prompt tuning method with mitigated label bias (M-Tuning)
It introduces open words from the WordNet to extend the range of words forming the prompt texts from only closed-set label words to more, and thus prompts are tuned in a simulated open-set scenario.
Our method achieves the best performance on datasets with various scales, and extensive ablation studies also validate its effectiveness.
arXiv Detail & Related papers (2023-03-09T09:05:47Z) - Like a Good Nearest Neighbor: Practical Content Moderation and Text
Classification [66.02091763340094]
Like a Good Nearest Neighbor (LaGoNN) is a modification to SetFit that introduces no learnable parameters but alters input text with information from its nearest neighbor.
LaGoNN is effective at flagging undesirable content and text classification, and improves the performance of SetFit.
arXiv Detail & Related papers (2023-02-17T15:43:29Z) - LASP: Text-to-Text Optimization for Language-Aware Soft Prompting of
Vision & Language Models [67.19124099815645]
We propose a novel Language-Aware Soft Prompting (LASP) learning method to alleviate base class overfitting.
LASP is inherently amenable to including, during training, virtual classes, i.e. class names for which no visual samples are available.
LASP matches and surpasses, for the first time, the accuracy on novel classes obtained by hand-crafted prompts and CLIP for 8 out of 11 test datasets.
arXiv Detail & Related papers (2022-10-03T17:56:35Z) - Selective Annotation Makes Language Models Better Few-Shot Learners [97.07544941620367]
Large language models can perform in-context learning, where they learn a new task from a few task demonstrations.
This work examines the implications of in-context learning for the creation of datasets for new natural language tasks.
We propose an unsupervised, graph-based selective annotation method, voke-k, to select diverse, representative examples to annotate.
arXiv Detail & Related papers (2022-09-05T14:01:15Z) - Annotation Error Detection: Analyzing the Past and Present for a More
Coherent Future [63.99570204416711]
We reimplement 18 methods for detecting potential annotation errors and evaluate them on 9 English datasets.
We define a uniform evaluation setup including a new formalization of the annotation error detection task.
We release our datasets and implementations in an easy-to-use and open source software package.
arXiv Detail & Related papers (2022-06-05T22:31:45Z) - Dominant Set-based Active Learning for Text Classification and its
Application to Online Social Media [0.0]
We present a novel pool-based active learning method for the training of large unlabeled corpus with minimum annotation cost.
Our proposed method does not have any parameters to be tuned, making it dataset-independent.
Our method achieves a higher performance in comparison to the state-of-the-art active learning strategies.
arXiv Detail & Related papers (2022-01-28T19:19:03Z) - Low-rank Dictionary Learning for Unsupervised Feature Selection [11.634317251468968]
We introduce a novel unsupervised feature selection approach by applying dictionary learning ideas in a low-rank representation.
A unified objective function for unsupervised feature selection is proposed in a sparse way by an $ell_2,1$-norm regularization.
Our experimental findings reveal that the proposed method outperforms the state-of-the-art algorithm.
arXiv Detail & Related papers (2021-06-21T13:39:10Z) - ALT-MAS: A Data-Efficient Framework for Active Testing of Machine
Learning Algorithms [58.684954492439424]
We propose a novel framework to efficiently test a machine learning model using only a small amount of labeled test data.
The idea is to estimate the metrics of interest for a model-under-test using Bayesian neural network (BNN)
arXiv Detail & Related papers (2021-04-11T12:14:04Z) - Data Augmentation in Natural Language Processing: A Novel Text
Generation Approach for Long and Short Text Classifiers [8.19984844136462]
We present and evaluate a text generation method suitable to increase the performance of classifiers for long and short texts.
In a simulated low data regime additive accuracy gains of up to 15.53% are achieved.
We discuss implications and patterns for the successful application of our approach on different types of datasets.
arXiv Detail & Related papers (2021-03-26T13:16:07Z) - Does a Hybrid Neural Network based Feature Selection Model Improve Text
Classification? [9.23545668304066]
We propose a hybrid feature selection method for obtaining relevant features.
We then present three ways of implementing a feature selection and neural network pipeline.
We also observed a slight increase in accuracy on some datasets.
arXiv Detail & Related papers (2021-01-22T09:12:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.