Text Data Augmentation: Towards better detection of spear-phishing
emails
- URL: http://arxiv.org/abs/2007.02033v2
- Date: Thu, 25 Mar 2021 14:54:10 GMT
- Title: Text Data Augmentation: Towards better detection of spear-phishing
emails
- Authors: Mehdi Regina and Maxime Meyer and S\'ebastien Goutal
- Abstract summary: We propose a corpus and task augmentation framework to augment English texts within our company.
Our proposal combines different methods, utilizing BERT language model, multi-step back-translation and agnostics.
We show that our augmentation framework improves performances on several text classification tasks using publicly available models and corpora.
- Score: 1.6556358263455926
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text data augmentation, i.e., the creation of new textual data from an
existing text, is challenging. Indeed, augmentation transformations should take
into account language complexity while being relevant to the target Natural
Language Processing (NLP) task (e.g., Machine Translation, Text
Classification). Initially motivated by an application of Business Email
Compromise (BEC) detection, we propose a corpus and task agnostic augmentation
framework used as a service to augment English texts within our company. Our
proposal combines different methods, utilizing BERT language model, multi-step
back-translation and heuristics. We show that our augmentation framework
improves performances on several text classification tasks using publicly
available models and corpora as well as on a BEC detection task. We also
provide a comprehensive argumentation about the limitations of our augmentation
framework.
Related papers
- Topic-to-essay generation with knowledge-based content selection [1.0625748132006634]
We propose a novel copy mechanism model with a content selection module that integrates rich semantic knowledge from the language model into the decoder.
Experimental results demonstrate that the proposed model can improve the generated text diversity by 35% to 59% compared to the state-of-the-art method.
arXiv Detail & Related papers (2024-02-26T02:14:42Z) - Adapting Large Language Models to Domains via Reading Comprehension [86.24451681746676]
We explore how continued pre-training on domain-specific corpora influences large language models.
We show that training on the raw corpora endows the model with domain knowledge, but drastically hurts its ability for question answering.
We propose a simple method for transforming raw corpora into reading comprehension texts.
arXiv Detail & Related papers (2023-09-18T07:17:52Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - AugGPT: Leveraging ChatGPT for Text Data Augmentation [59.76140039943385]
We propose a text data augmentation approach based on ChatGPT (named AugGPT)
AugGPT rephrases each sentence in the training samples into multiple conceptually similar but semantically different samples.
Experiment results on few-shot learning text classification tasks show the superior performance of the proposed AugGPT approach.
arXiv Detail & Related papers (2023-02-25T06:58:16Z) - Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection.
We propose to learn contextualized, joint representations through vision-language pre-training.
The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z) - To Augment or Not to Augment? A Comparative Study on Text Augmentation
Techniques for Low-Resource NLP [0.0]
We investigate three categories of text augmentation methodologies which perform changes on the syntax.
We compare them on part-of-speech tagging, dependency parsing and semantic role labeling for a diverse set of language families.
Our results suggest that the augmentation techniques can further improve over strong baselines based on mBERT.
arXiv Detail & Related papers (2021-11-18T10:52:48Z) - Pre-training Language Model Incorporating Domain-specific Heterogeneous Knowledge into A Unified Representation [49.89831914386982]
We propose a unified pre-trained language model (PLM) for all forms of text, including unstructured text, semi-structured text, and well-structured text.
Our approach outperforms the pre-training of plain text using only 1/4 of the data.
arXiv Detail & Related papers (2021-09-02T16:05:24Z) - Exploring the Limits of Transfer Learning with a Unified Text-to-Text
Transformer [64.22926988297685]
Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP)
In this paper, we explore the landscape of introducing transfer learning techniques for NLP by a unified framework that converts all text-based language problems into a text-to-text format.
arXiv Detail & Related papers (2019-10-23T17:37:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.