Maximizing Information Gain in Privacy-Aware Active Learning of Email Anomalies
- URL: http://arxiv.org/abs/2405.07440v2
- Date: Thu, 05 Dec 2024 03:26:16 GMT
- Title: Maximizing Information Gain in Privacy-Aware Active Learning of Email Anomalies
- Authors: Mu-Huan Miles Chung, Sharon Li, Jaturong Kongmanee, Lu Wang, Yuhong Yang, Calvin Giang, Khilan Jerath, Abhay Raman, David Lie, Mark Chignell,
- Abstract summary: We develop an enhanced method of Active Learning using an information gain maximizing data.
We evaluate its effectiveness in a real world setting where only redacted versions of email could be labeled by human analysts.
- Score: 7.770699559625337
- License:
- Abstract: Redacted emails satisfy most privacy requirements but they make it more difficult to detect anomalous emails that may be indicative of data exfiltration. In this paper we develop an enhanced method of Active Learning using an information gain maximizing heuristic, and we evaluate its effectiveness in a real world setting where only redacted versions of email could be labeled by human analysts due to privacy concerns. In the first case study we examined how Active Learning should be carried out. We found that model performance was best when a single highly skilled (in terms of the labelling task) analyst provided the labels. In the second case study we used confidence ratings to estimate the labeling uncertainty of analysts and then prioritized instances for labeling based on the expected information gain (the difference between model uncertainty and analyst uncertainty) that would be provided by labelling each instance. We found that the information maximization gain heuristic improved model performance over existing sampling methods for Active Learning. Based on the results obtained, we recommend that analysts should be screened, and possibly trained, prior to implementation of Active Learning in cybersecurity applications. We also recommend that the information gain maximizing sample method (based on expert confidence) should be used in early stages of Active Learning, providing that well-calibrated confidence can be obtained. We also note that the expertise of analysts should be assessed prior to Active Learning, as we found that analysts with lower labelling skill had poorly calibrated (over-) confidence in their labels.
Related papers
- KBAlign: Efficient Self Adaptation on Specific Knowledge Bases [73.34893326181046]
Large language models (LLMs) usually rely on retrieval-augmented generation to exploit knowledge materials in an instant manner.
We propose KBAlign, an approach designed for efficient adaptation to downstream tasks involving knowledge bases.
Our method utilizes iterative training with self-annotated data such as Q&A pairs and revision suggestions, enabling the model to grasp the knowledge content efficiently.
arXiv Detail & Related papers (2024-11-22T08:21:03Z) - Protecting Privacy Through Approximating Optimal Parameters for Sequence Unlearning in Language Models [37.172662930947446]
Language models (LMs) are potentially vulnerable to extraction attacks, which represent a significant privacy risk.
We propose Privacy Protection via Optimal Parameters (POP), a novel unlearning method that effectively forgets the target token sequences from the pretrained LM.
POP exhibits remarkable retention performance post-unlearning across 9 classification and 4 dialogue benchmarks, outperforming the state-of-the-art by a large margin.
arXiv Detail & Related papers (2024-06-20T08:12:49Z) - Uncertainty for Active Learning on Graphs [70.44714133412592]
Uncertainty Sampling is an Active Learning strategy that aims to improve the data efficiency of machine learning models.
We benchmark Uncertainty Sampling beyond predictive uncertainty and highlight a significant performance gap to other Active Learning strategies.
We develop ground-truth Bayesian uncertainty estimates in terms of the data generating process and prove their effectiveness in guiding Uncertainty Sampling toward optimal queries.
arXiv Detail & Related papers (2024-05-02T16:50:47Z) - XAL: EXplainable Active Learning Makes Classifiers Better Low-resource Learners [71.8257151788923]
We propose a novel Explainable Active Learning framework (XAL) for low-resource text classification.
XAL encourages classifiers to justify their inferences and delve into unlabeled data for which they cannot provide reasonable explanations.
Experiments on six datasets show that XAL achieves consistent improvement over 9 strong baselines.
arXiv Detail & Related papers (2023-10-09T08:07:04Z) - Data AUDIT: Identifying Attribute Utility- and Detectability-Induced
Bias in Task Models [8.420252576694583]
We present a first technique for the rigorous, quantitative screening of medical image datasets.
Our method decomposes the risks associated with dataset attributes in terms of their detectability and utility.
Using our method, we show our screening method reliably identifies nearly imperceptible bias-inducing artifacts.
arXiv Detail & Related papers (2023-04-06T16:50:15Z) - Implementing Active Learning in Cybersecurity: Detecting Anomalies in
Redacted Emails [10.303697869042283]
We present research results concerning the application of Active Learning to anomaly detection in redacted emails.
We evaluate different AL strategies and their impact on resulting model performance.
arXiv Detail & Related papers (2023-03-01T23:53:01Z) - Responsible Active Learning via Human-in-the-loop Peer Study [88.01358655203441]
We propose a responsible active learning method, namely Peer Study Learning (PSL), to simultaneously preserve data privacy and improve model stability.
We first introduce a human-in-the-loop teacher-student architecture to isolate unlabelled data from the task learner (teacher) on the cloud-side.
During training, the task learner instructs the light-weight active learner which then provides feedback on the active sampling criterion.
arXiv Detail & Related papers (2022-11-24T13:18:27Z) - Knowledge-driven Active Learning [70.37119719069499]
Active learning strategies aim at minimizing the amount of labelled data required to train a Deep Learning model.
Most active strategies are based on uncertain sample selection, and even often restricted to samples lying close to the decision boundary.
Here we propose to take into consideration common domain-knowledge and enable non-expert users to train a model with fewer samples.
arXiv Detail & Related papers (2021-10-15T06:11:53Z) - Investigating a Baseline Of Self Supervised Learning Towards Reducing
Labeling Costs For Image Classification [0.0]
The study implements the kaggle.com' cats-vs-dogs dataset, Mnist and Fashion-Mnist to investigate the self-supervised learning task.
Results show that the pretext process in the self-supervised learning improves the accuracy around 15% in the downstream classification task.
arXiv Detail & Related papers (2021-08-17T06:43:05Z) - Low-Regret Active learning [64.36270166907788]
We develop an online learning algorithm for identifying unlabeled data points that are most informative for training.
At the core of our work is an efficient algorithm for sleeping experts that is tailored to achieve low regret on predictable (easy) instances.
arXiv Detail & Related papers (2021-04-06T22:53:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.