Mind the Gap: On Bridging the Semantic Gap between Machine Learning and
Information Security
- URL: http://arxiv.org/abs/2005.01800v1
- Date: Mon, 4 May 2020 19:19:32 GMT
- Title: Mind the Gap: On Bridging the Semantic Gap between Machine Learning and
Information Security
- Authors: Michael R. Smith, Nicholas T. Johnson, Joe B. Ingram, Armida J.
Carbajal, Ramyaa Ramyaa, Evelyn Domschot, Christopher C. Lamb, Stephen J.
Verzi, W. Philip Kegelmeyer
- Abstract summary: Despite the potential of Machine learning to learn the behavior of malware, detect novel malware samples, and significantly improve information security we see few, if any, high-impact ML techniques in deployed systems.
We hypothesize that the failure of ML in making high-impacts in InfoSec are rooted in a disconnect between the two communities.
Specifically, current datasets and representations used by ML are not suitable for learning the behaviors of an executable.
- Score: 3.9629825964453986
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the potential of Machine learning (ML) to learn the behavior of
malware, detect novel malware samples, and significantly improve information
security (InfoSec) we see few, if any, high-impact ML techniques in deployed
systems, notwithstanding multiple reported successes in open literature. We
hypothesize that the failure of ML in making high-impacts in InfoSec are rooted
in a disconnect between the two communities as evidenced by a semantic gap---a
difference in how executables are described (e.g. the data and features
extracted from the data). Specifically, current datasets and representations
used by ML are not suitable for learning the behaviors of an executable and
differ significantly from those used by the InfoSec community. In this paper,
we survey existing datasets used for classifying malware by ML algorithms and
the features that are extracted from the data. We observe that: 1) the current
set of extracted features are primarily syntactic, not behavioral, 2) datasets
generally contain extreme exemplars producing a dataset in which it is easy to
discriminate classes, and 3) the datasets provide significantly different
representations of the data encountered in real-world systems. For ML to make
more of an impact in the InfoSec community requires a change in the data
(including the features and labels) that is used to bridge the current semantic
gap. As a first step in enabling more behavioral analyses, we label existing
malware datasets with behavioral features using open-source threat reports
associated with malware families. This behavioral labeling alters the analysis
from identifying intent (e.g. good vs bad) or malware family membership to an
analysis of which behaviors are exhibited by an executable. We offer the
annotations with the hope of inspiring future improvements in the data that
will further bridge the semantic gap between the ML and InfoSec communities.
Related papers
- Don't Push the Button! Exploring Data Leakage Risks in Machine Learning
and Transfer Learning [0.0]
This paper addresses a critical issue in Machine Learning (ML) where unintended information contaminates the training data, impacting model performance evaluation.
The discrepancy between evaluated and actual performance on new data is a significant concern.
It explores the connection between data leakage and the specific task being addressed, investigates its occurrence in Transfer Learning, and compares standard inductive ML with transductive ML frameworks.
arXiv Detail & Related papers (2024-01-24T20:30:52Z) - Into the LAIONs Den: Investigating Hate in Multimodal Datasets [67.21783778038645]
This paper investigates the effect of scaling datasets on hateful content through a comparative audit of two datasets: LAION-400M and LAION-2B.
We found that hate content increased by nearly 12% with dataset scale, measured both qualitatively and quantitatively.
We also found that filtering dataset contents based on Not Safe For Work (NSFW) values calculated based on images alone does not exclude all the harmful content in alt-text.
arXiv Detail & Related papers (2023-11-06T19:00:05Z) - EMBERSim: A Large-Scale Databank for Boosting Similarity Search in
Malware Analysis [48.5877840394508]
In recent years there has been a shift from quantifications-based malware detection towards machine learning.
We propose to address the deficiencies in the space of similarity research on binary files, starting from EMBER.
We enhance EMBER with similarity information as well as malware class tags, to enable further research in the similarity space.
arXiv Detail & Related papers (2023-10-03T06:58:45Z) - Decoding the Secrets of Machine Learning in Malware Classification: A
Deep Dive into Datasets, Feature Extraction, and Model Performance [25.184668510417545]
We collect the largest balanced malware dataset so far with 67K samples from 670 families (100 samples each)
We train state-of-the-art models for malware detection and family classification using our dataset.
Our results reveal that static features perform better than dynamic features, and that combining both only provides marginal improvement over static features.
arXiv Detail & Related papers (2023-07-27T07:18:10Z) - SEAL : Interactive Tool for Systematic Error Analysis and Labeling [26.803598323167382]
This paper introduces an interactive Systematic Error Analysis and Labeling (seal) tool.
It uses a two-step approach to first identify high error slices of data and then, in the second step, introduce methods to give human-understandable semantics to those underperforming slices.
We explore a variety of methods for coming up with coherent semantics for the error groups using language models for semantic labeling and a text-to-image model for generating visual features.
arXiv Detail & Related papers (2022-10-11T23:51:44Z) - Towards a Fair Comparison and Realistic Design and Evaluation Framework
of Android Malware Detectors [63.75363908696257]
We analyze 10 influential research works on Android malware detection using a common evaluation framework.
We identify five factors that, if not taken into account when creating datasets and designing detectors, significantly affect the trained ML models.
We conclude that the studied ML-based detectors have been evaluated optimistically, which justifies the good published results.
arXiv Detail & Related papers (2022-05-25T08:28:08Z) - S3M: Siamese Stack (Trace) Similarity Measure [55.58269472099399]
We present S3M -- the first approach to computing stack trace similarity based on deep learning.
It is based on a biLSTM encoder and a fully-connected classifier to compute similarity.
Our experiments demonstrate the superiority of our approach over the state-of-the-art on both open-sourced data and a private JetBrains dataset.
arXiv Detail & Related papers (2021-03-18T21:10:41Z) - D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using
Differential Analysis [55.15995704119158]
We propose D2A, a differential analysis based approach to label issues reported by static analysis tools.
We use D2A to generate a large labeled dataset to train models for vulnerability identification.
arXiv Detail & Related papers (2021-02-16T07:46:53Z) - Meta Discovery: Learning to Discover Novel Classes given Very Limited
Data [59.90813997957849]
In this paper, we analyze and improve L2DNC by linking it to meta-learning.
L2DNC is not only theoretically solvable, but also can be empirically solved by meta-learning algorithms slightly modified to fit our proposed framework.
arXiv Detail & Related papers (2021-02-08T04:53:14Z) - DAEMON: Dataset-Agnostic Explainable Malware Classification Using
Multi-Stage Feature Mining [3.04585143845864]
Malware classification is the task of determining to which family a new malicious variant belongs.
We present DAEMON, a novel dataset-agnostic malware classification tool.
arXiv Detail & Related papers (2020-08-04T21:57:30Z) - Why an Android App is Classified as Malware? Towards Malware
Classification Interpretation [34.59397128785141]
We propose a novel and interpretable ML-based approach (named XMal) to classify malware with high accuracy and explain the classification result.
XMal hinges multi-layer perceptron (MLP) and attention mechanism, and also pinpoints the key features most related to the classification result.
Our study peeks into the interpretable ML through the research of Android malware detection and analysis.
arXiv Detail & Related papers (2020-04-24T03:05:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.