Machine Learning for Detecting Data Exfiltration: A Review
- URL: http://arxiv.org/abs/2012.09344v2
- Date: Sun, 21 Mar 2021 23:23:14 GMT
- Title: Machine Learning for Detecting Data Exfiltration: A Review
- Authors: Bushra Sabir, Faheem Ullah, M. Ali Babar and Raj Gaire
- Abstract summary: Research at the intersection of cybersecurity, Machine Learning (ML), and Software Engineering (SE) has recently taken significant steps in proposing countermeasures for detecting sophisticated data exfiltration attacks.
This paper aims at systematically reviewing ML-based data exfiltration countermeasures to identify and classify ML approaches, feature engineering techniques, evaluation datasets, and performance metrics used for these countermeasures.
- Score: 1.949912057689623
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Context: Research at the intersection of cybersecurity, Machine Learning
(ML), and Software Engineering (SE) has recently taken significant steps in
proposing countermeasures for detecting sophisticated data exfiltration
attacks. It is important to systematically review and synthesize the ML-based
data exfiltration countermeasures for building a body of knowledge on this
important topic. Objective: This paper aims at systematically reviewing
ML-based data exfiltration countermeasures to identify and classify ML
approaches, feature engineering techniques, evaluation datasets, and
performance metrics used for these countermeasures. This review also aims at
identifying gaps in research on ML-based data exfiltration countermeasures.
Method: We used a Systematic Literature Review (SLR) method to select and
review {92} papers. Results: The review has enabled us to (a) classify the ML
approaches used in the countermeasures into data-driven, and behaviour-driven
approaches, (b) categorize features into six types: behavioural, content-based,
statistical, syntactical, spatial and temporal, (c) classify the evaluation
datasets into simulated, synthesized, and real datasets and (d) identify 11
performance measures used by these studies. Conclusion: We conclude that: (i)
the integration of data-driven and behaviour-driven approaches should be
explored; (ii) There is a need of developing high quality and large size
evaluation datasets; (iii) Incremental ML model training should be incorporated
in countermeasures; (iv) resilience to adversarial learning should be
considered and explored during the development of countermeasures to avoid
poisoning attacks; and (v) the use of automated feature engineering should be
encouraged for efficiently detecting data exfiltration attacks.
Related papers
- Impact of Missing Values in Machine Learning: A Comprehensive Analysis [0.0]
This paper aims to examine the nuanced impact of missing values on machine learning (ML) models.
Our analysis focuses on the challenges posed by missing values, including biased inferences, reduced predictive power, and increased computational burdens.
The study employs case studies and real-world examples to illustrate the practical implications of addressing missing values.
arXiv Detail & Related papers (2024-10-10T18:31:44Z) - Detecting Training Data of Large Language Models via Expectation Maximization [62.28028046993391]
Membership inference attacks (MIAs) aim to determine whether a specific instance was part of a target model's training data.
Applying MIAs to large language models (LLMs) presents unique challenges due to the massive scale of pre-training data and the ambiguous nature of membership.
We introduce EM-MIA, a novel MIA method for LLMs that iteratively refines membership scores and prefix scores via an expectation-maximization algorithm.
arXiv Detail & Related papers (2024-10-10T03:31:16Z) - Probing Language Models for Pre-training Data Detection [11.37731401086372]
We propose to utilize the probing technique for pre-training data detection by examining the model's internal activations.
Our method is simple and effective and leads to more trustworthy pre-training data detection.
arXiv Detail & Related papers (2024-06-03T13:58:04Z) - The Frontier of Data Erasure: Machine Unlearning for Large Language Models [56.26002631481726]
Large Language Models (LLMs) are foundational to AI advancements.
LLMs pose risks by potentially memorizing and disseminating sensitive, biased, or copyrighted information.
Machine unlearning emerges as a cutting-edge solution to mitigate these concerns.
arXiv Detail & Related papers (2024-03-23T09:26:15Z) - Towards Better Modeling with Missing Data: A Contrastive Learning-based
Visual Analytics Perspective [7.577040836988683]
Missing data can pose a challenge for machine learning (ML) modeling.
Current approaches are categorized into feature imputation and label prediction.
This study proposes a Contrastive Learning framework to model observed data with missing values.
arXiv Detail & Related papers (2023-09-18T13:16:24Z) - Mitigating ML Model Decay in Continuous Integration with Data Drift
Detection: An Empirical Study [7.394099294390271]
This study aims to investigate the performance of using data drift detection techniques for automatically detecting the retraining points for ML models for TCP in CI environments.
We employed the Hellinger distance to identify changes in both the values and distribution of input data and leveraged these changes as retraining points for the ML model.
Our experimental evaluation of the Hellinger distance-based method demonstrated its efficacy and efficiency in detecting retraining points and reducing the associated costs.
arXiv Detail & Related papers (2023-05-22T05:55:23Z) - Detection and Evaluation of Clusters within Sequential Data [58.720142291102135]
Clustering algorithms for Block Markov Chains possess theoretical optimality guarantees.
In particular, our sequential data is derived from human DNA, written text, animal movement data and financial markets.
It is found that the Block Markov Chain model assumption can indeed produce meaningful insights in exploratory data analyses.
arXiv Detail & Related papers (2022-10-04T15:22:39Z) - Robustness Evaluation of Deep Unsupervised Learning Algorithms for
Intrusion Detection Systems [0.0]
This paper evaluates the robustness of six recent deep learning algorithms for intrusion detection on contaminated data.
Our experiments suggest that the state-of-the-art algorithms used in this study are sensitive to data contamination and reveal the importance of self-defense against data perturbation.
arXiv Detail & Related papers (2022-06-25T02:28:39Z) - Practical Machine Learning Safety: A Survey and Primer [81.73857913779534]
Open-world deployment of Machine Learning algorithms in safety-critical applications such as autonomous vehicles needs to address a variety of ML vulnerabilities.
New models and training techniques to reduce generalization error, achieve domain adaptation, and detect outlier examples and adversarial attacks.
Our organization maps state-of-the-art ML techniques to safety strategies in order to enhance the dependability of the ML algorithm from different aspects.
arXiv Detail & Related papers (2021-06-09T05:56:42Z) - DEALIO: Data-Efficient Adversarial Learning for Imitation from
Observation [57.358212277226315]
In imitation learning from observation IfO, a learning agent seeks to imitate a demonstrating agent using only observations of the demonstrated behavior without access to the control signals generated by the demonstrator.
Recent methods based on adversarial imitation learning have led to state-of-the-art performance on IfO problems, but they typically suffer from high sample complexity due to a reliance on data-inefficient, model-free reinforcement learning algorithms.
This issue makes them impractical to deploy in real-world settings, where gathering samples can incur high costs in terms of time, energy, and risk.
We propose a more data-efficient IfO algorithm
arXiv Detail & Related papers (2021-03-31T23:46:32Z) - Estimating Structural Target Functions using Machine Learning and
Influence Functions [103.47897241856603]
We propose a new framework for statistical machine learning of target functions arising as identifiable functionals from statistical models.
This framework is problem- and model-agnostic and can be used to estimate a broad variety of target parameters of interest in applied statistics.
We put particular focus on so-called coarsening at random/doubly robust problems with partially unobserved information.
arXiv Detail & Related papers (2020-08-14T16:48:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.