Related papers: A pipeline and comparative study of 12 machine learning models for text classification

A pipeline and comparative study of 12 machine learning models for text classification

URL: http://arxiv.org/abs/2204.06518v1
Date: Mon, 4 Apr 2022 23:51:22 GMT
Title: A pipeline and comparative study of 12 machine learning models for text classification
Authors: Annalisa Occhipinti, Louis Rogers, Claudio Angione
Abstract summary: Text-based communication is highly favoured as a communication method, especially in business environments. Many machine learning methods for text classification have been proposed and incorporated into the services of most email providers. However, optimising text classification algorithms and finding the right tradeoff on their aggressiveness is still a major research problem.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Text-based communication is highly favoured as a communication method, especially in business environments. As a result, it is often abused by sending malicious messages, e.g., spam emails, to deceive users into relaying personal information, including online accounts credentials or banking details. For this reason, many machine learning methods for text classification have been proposed and incorporated into the services of most email providers. However, optimising text classification algorithms and finding the right tradeoff on their aggressiveness is still a major research problem. We present an updated survey of 12 machine learning text classifiers applied to a public spam corpus. A new pipeline is proposed to optimise hyperparameter selection and improve the models' performance by applying specific methods (based on natural language processing) in the preprocessing stage. Our study aims to provide a new methodology to investigate and optimise the effect of different feature sizes and hyperparameters in machine learning classifiers that are widely used in text classification problems. The classifiers are tested and evaluated on different metrics including F-score (accuracy), precision, recall, and run time. By analysing all these aspects, we show how the proposed pipeline can be used to achieve a good accuracy towards spam filtering on the Enron dataset, a widely used public email corpus. Statistical tests and explainability techniques are applied to provide a robust analysis of the proposed pipeline and interpret the classification outcomes of the 12 machine learning models, also identifying words that drive the classification results. Our analysis shows that it is possible to identify an effective machine learning model to classify the Enron dataset with an F-score of 94%.

Related papers

Label-template based Few-Shot Text Classification with Contrastive Learning [7.964862748983985]
We propose a simple and effective few-shot text classification framework. Label templates are embedded into input sentences to fully utilize the potential value of class labels. supervised contrastive learning is utilized to model the interaction information between support samples and query samples.
arXiv Detail & Related papers (2024-12-13T12:51:50Z)
Likelihood as a Performance Gauge for Retrieval-Augmented Generation [78.28197013467157]
We show that likelihoods serve as an effective gauge for language model performance. We propose two methods that use question likelihood as a gauge for selecting and constructing prompts that lead to better performance.
arXiv Detail & Related papers (2024-11-12T13:14:09Z)
Prompt-based Personality Profiling: Reinforcement Learning for Relevance Filtering [8.20929362102942]
Author profiling is the task of inferring characteristics about individuals by analyzing content they share. We propose a new method for author profiling which aims at distinguishing relevant from irrelevant content first, followed by the actual user profiling only with relevant data. We evaluate our method for Big Five personality trait prediction on two Twitter corpora.
arXiv Detail & Related papers (2024-09-06T08:43:10Z)
M-Tuning: Prompt Tuning with Mitigated Label Bias in Open-Set Scenarios [103.6153593636399]
We propose a vision-language prompt tuning method with mitigated label bias (M-Tuning) It introduces open words from the WordNet to extend the range of words forming the prompt texts from only closed-set label words to more, and thus prompts are tuned in a simulated open-set scenario. Our method achieves the best performance on datasets with various scales, and extensive ablation studies also validate its effectiveness.
arXiv Detail & Related papers (2023-03-09T09:05:47Z)
Like a Good Nearest Neighbor: Practical Content Moderation and Text Classification [66.02091763340094]
Like a Good Nearest Neighbor (LaGoNN) is a modification to SetFit that introduces no learnable parameters but alters input text with information from its nearest neighbor. LaGoNN is effective at flagging undesirable content and text classification, and improves the performance of SetFit.
arXiv Detail & Related papers (2023-02-17T15:43:29Z)
LASP: Text-to-Text Optimization for Language-Aware Soft Prompting of Vision & Language Models [67.19124099815645]
We propose a novel Language-Aware Soft Prompting (LASP) learning method to alleviate base class overfitting. LASP is inherently amenable to including, during training, virtual classes, i.e. class names for which no visual samples are available. LASP matches and surpasses, for the first time, the accuracy on novel classes obtained by hand-crafted prompts and CLIP for 8 out of 11 test datasets.
arXiv Detail & Related papers (2022-10-03T17:56:35Z)
Selective Annotation Makes Language Models Better Few-Shot Learners [97.07544941620367]
Large language models can perform in-context learning, where they learn a new task from a few task demonstrations. This work examines the implications of in-context learning for the creation of datasets for new natural language tasks. We propose an unsupervised, graph-based selective annotation method, voke-k, to select diverse, representative examples to annotate.
arXiv Detail & Related papers (2022-09-05T14:01:15Z)
Annotation Error Detection: Analyzing the Past and Present for a More Coherent Future [63.99570204416711]
We reimplement 18 methods for detecting potential annotation errors and evaluate them on 9 English datasets. We define a uniform evaluation setup including a new formalization of the annotation error detection task. We release our datasets and implementations in an easy-to-use and open source software package.
arXiv Detail & Related papers (2022-06-05T22:31:45Z)
Dominant Set-based Active Learning for Text Classification and its Application to Online Social Media [0.0]
We present a novel pool-based active learning method for the training of large unlabeled corpus with minimum annotation cost. Our proposed method does not have any parameters to be tuned, making it dataset-independent. Our method achieves a higher performance in comparison to the state-of-the-art active learning strategies.
arXiv Detail & Related papers (2022-01-28T19:19:03Z)
Low-rank Dictionary Learning for Unsupervised Feature Selection [11.634317251468968]
We introduce a novel unsupervised feature selection approach by applying dictionary learning ideas in a low-rank representation. A unified objective function for unsupervised feature selection is proposed in a sparse way by an $ell_2,1$-norm regularization. Our experimental findings reveal that the proposed method outperforms the state-of-the-art algorithm.
arXiv Detail & Related papers (2021-06-21T13:39:10Z)
ALT-MAS: A Data-Efficient Framework for Active Testing of Machine Learning Algorithms [58.684954492439424]
We propose a novel framework to efficiently test a machine learning model using only a small amount of labeled test data. The idea is to estimate the metrics of interest for a model-under-test using Bayesian neural network (BNN)
arXiv Detail & Related papers (2021-04-11T12:14:04Z)
Data Augmentation in Natural Language Processing: A Novel Text Generation Approach for Long and Short Text Classifiers [8.19984844136462]
We present and evaluate a text generation method suitable to increase the performance of classifiers for long and short texts. In a simulated low data regime additive accuracy gains of up to 15.53% are achieved. We discuss implications and patterns for the successful application of our approach on different types of datasets.
arXiv Detail & Related papers (2021-03-26T13:16:07Z)
Does a Hybrid Neural Network based Feature Selection Model Improve Text Classification? [9.23545668304066]
We propose a hybrid feature selection method for obtaining relevant features. We then present three ways of implementing a feature selection and neural network pipeline. We also observed a slight increase in accuracy on some datasets.
arXiv Detail & Related papers (2021-01-22T09:12:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.