Text generation for dataset augmentation in security classification
tasks
- URL: http://arxiv.org/abs/2310.14429v1
- Date: Sun, 22 Oct 2023 22:25:14 GMT
- Title: Text generation for dataset augmentation in security classification
tasks
- Authors: Alexander P. Welsh and Matthew Edwards
- Abstract summary: This study evaluates the application of natural language text generators to fill this data gap in multiple security-related text classification tasks.
We find substantial benefits for GPT-3 data augmentation strategies in situations with severe limitations on known positive-class samples.
- Score: 55.70844429868403
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Security classifiers, designed to detect malicious content in computer
systems and communications, can underperform when provided with insufficient
training data. In the security domain, it is often easy to find samples of the
negative (benign) class, and challenging to find enough samples of the positive
(malicious) class to train an effective classifier. This study evaluates the
application of natural language text generators to fill this data gap in
multiple security-related text classification tasks. We describe a variety of
previously-unexamined language-model fine-tuning approaches for this purpose
and consider in particular the impact of disproportionate class-imbalances in
the training set. Across our evaluation using three state-of-the-art
classifiers designed for offensive language detection, review fraud detection,
and SMS spam detection, we find that models trained with GPT-3 data
augmentation strategies outperform both models trained without augmentation and
models trained using basic data augmentation strategies already in common
usage. In particular, we find substantial benefits for GPT-3 data augmentation
strategies in situations with severe limitations on known positive-class
samples.
Related papers
- Selecting Between BERT and GPT for Text Classification in Political Science Research [4.487884986288122]
We evaluate the effectiveness of BERT-based versus GPT-based models in low-data scenarios.
We conclude by comparing these approaches in terms of performance, ease of use, and cost.
arXiv Detail & Related papers (2024-11-07T07:29:39Z) - Generalization Properties of Retrieval-based Models [50.35325326050263]
Retrieval-based machine learning methods have enjoyed success on a wide range of problems.
Despite growing literature showcasing the promise of these models, the theoretical underpinning for such models remains underexplored.
We present a formal treatment of retrieval-based models to characterize their generalization ability.
arXiv Detail & Related papers (2022-10-06T00:33:01Z) - Multi-Level Fine-Tuning, Data Augmentation, and Few-Shot Learning for
Specialized Cyber Threat Intelligence [0.0]
We propose a system to train a new classifier for each new incident.
This requires a lot of labelled data using standard training methods.
We evaluate our approach using a novel dataset derived from the Microsoft Exchange Server data breach of 2021.
arXiv Detail & Related papers (2022-07-22T13:34:28Z) - Guiding Generative Language Models for Data Augmentation in Few-Shot
Text Classification [59.698811329287174]
We leverage GPT-2 for generating artificial training instances in order to improve classification performance.
Our results show that fine-tuning GPT-2 in a handful of label instances leads to consistent classification improvements.
arXiv Detail & Related papers (2021-11-17T12:10:03Z) - Few-Shot Text Classification with Triplet Networks, Data Augmentation,
and Curriculum Learning [11.66053357388062]
Few-shot text classification is a fundamental NLP task in which a model aims to classify text into a large number of categories.
This paper explores data augmentation -- a technique particularly suitable for training with limited data.
We find that common data augmentation techniques can improve the performance of triplet networks by up to 3.0% on average.
arXiv Detail & Related papers (2021-03-12T22:07:35Z) - Improving speech recognition models with small samples for air traffic
control systems [9.322392779428505]
In this work, a novel training approach based on pretraining and transfer learning is proposed to address the issue of small training samples.
Three real ATC datasets are used to validate the proposed ASR model and training strategies.
The experimental results demonstrate that the ASR performance is significantly improved on all three datasets.
arXiv Detail & Related papers (2021-02-16T08:28:52Z) - Few-Shot Named Entity Recognition: A Comprehensive Study [92.40991050806544]
We investigate three schemes to improve the model generalization ability for few-shot settings.
We perform empirical comparisons on 10 public NER datasets with various proportions of labeled data.
We create new state-of-the-art results on both few-shot and training-free settings.
arXiv Detail & Related papers (2020-12-29T23:43:16Z) - DAGA: Data Augmentation with a Generation Approach for Low-resource
Tagging Tasks [88.62288327934499]
We propose a novel augmentation method with language models trained on the linearized labeled sentences.
Our method is applicable to both supervised and semi-supervised settings.
arXiv Detail & Related papers (2020-11-03T07:49:15Z) - Revisiting LSTM Networks for Semi-Supervised Text Classification via
Mixed Objective Function [106.69643619725652]
We develop a training strategy that allows even a simple BiLSTM model, when trained with cross-entropy loss, to achieve competitive results.
We report state-of-the-art results for text classification task on several benchmark datasets.
arXiv Detail & Related papers (2020-09-08T21:55:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.