Quo Vadis: Hybrid Machine Learning Meta-Model based on Contextual and
Behavioral Malware Representations
- URL: http://arxiv.org/abs/2208.12248v1
- Date: Sat, 20 Aug 2022 05:30:16 GMT
- Title: Quo Vadis: Hybrid Machine Learning Meta-Model based on Contextual and
Behavioral Malware Representations
- Authors: Dmitrijs Trizna
- Abstract summary: We propose a hybrid machine learning architecture that simultaneously employs multiple deep learning models.
We report an improved detection rate, above the capabilities of the current state-of-the-art model.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We propose a hybrid machine learning architecture that simultaneously employs
multiple deep learning models analyzing contextual and behavioral
characteristics of Windows portable executable, producing a final prediction
based on a decision from the meta-model. The detection heuristic in
contemporary machine learning Windows malware classifiers is typically based on
the static properties of the sample since dynamic analysis through
virtualization is challenging for vast quantities of samples. To surpass this
limitation, we employ a Windows kernel emulation that allows the acquisition of
behavioral patterns across large corpora with minimal temporal and
computational costs. We partner with a security vendor for a collection of more
than 100k int-the-wild samples that resemble the contemporary threat landscape,
containing raw PE files and filepaths of applications at the moment of
execution. The acquired dataset is at least ten folds larger than reported in
related works on behavioral malware analysis. Files in the training dataset are
labeled by a professional threat intelligence team, utilizing manual and
automated reverse engineering tools. We estimate the hybrid classifier's
operational utility by collecting an out-of-sample test set three months later
from the acquisition of the training set. We report an improved detection rate,
above the capabilities of the current state-of-the-art model, especially under
low false-positive requirements. Additionally, we uncover a meta-model's
ability to identify malicious activity in validation and test sets even if none
of the individual models express enough confidence to mark the sample as
malevolent. We conclude that the meta-model can learn patterns typical to
malicious samples from representation combinations produced by different
analysis techniques. We publicly release pre-trained models and anonymized
dataset of emulation reports.
Related papers
- Small Effect Sizes in Malware Detection? Make Harder Train/Test Splits! [51.668411293817464]
Industry practitioners care about small improvements in malware detection accuracy because their models are deployed to hundreds of millions of machines.
Academic research is often restrained to public datasets on the order of ten thousand samples.
We devise an approach to generate a benchmark of difficulty from a pool of available samples.
arXiv Detail & Related papers (2023-12-25T21:25:55Z) - IoTGeM: Generalizable Models for Behaviour-Based IoT Attack Detection [3.3772986620114387]
We present an approach for modelling IoT network attacks that focuses on generalizability, yet also leads to better detection and performance.
First, we present an improved rolling window approach for feature extraction, and introduce a multi-step feature selection process that reduces overfitting.
Second, we build and test models using isolated train and test datasets, thereby avoiding common data leaks.
Third, we rigorously evaluate our methodology using a diverse portfolio of machine learning models, evaluation metrics and datasets.
arXiv Detail & Related papers (2023-10-17T21:46:43Z) - Decoding the Secrets of Machine Learning in Malware Classification: A
Deep Dive into Datasets, Feature Extraction, and Model Performance [25.184668510417545]
We collect the largest balanced malware dataset so far with 67K samples from 670 families (100 samples each)
We train state-of-the-art models for malware detection and family classification using our dataset.
Our results reveal that static features perform better than dynamic features, and that combining both only provides marginal improvement over static features.
arXiv Detail & Related papers (2023-07-27T07:18:10Z) - Unleashing Mask: Explore the Intrinsic Out-of-Distribution Detection
Capability [70.72426887518517]
Out-of-distribution (OOD) detection is an indispensable aspect of secure AI when deploying machine learning models in real-world applications.
We propose a novel method, Unleashing Mask, which aims to restore the OOD discriminative capabilities of the well-trained model with ID data.
Our method utilizes a mask to figure out the memorized atypical samples, and then finetune the model or prune it with the introduced mask to forget them.
arXiv Detail & Related papers (2023-06-06T14:23:34Z) - Provable Robustness for Streaming Models with a Sliding Window [51.85182389861261]
In deep learning applications such as online content recommendation and stock market analysis, models use historical data to make predictions.
We derive robustness certificates for models that use a fixed-size sliding window over the input stream.
Our guarantees hold for the average model performance across the entire stream and are independent of stream size, making them suitable for large data streams.
arXiv Detail & Related papers (2023-03-28T21:02:35Z) - Zero-shot Model Diagnosis [80.36063332820568]
A common approach to evaluate deep learning models is to build a labeled test set with attributes of interest and assess how well it performs.
This paper argues the case that Zero-shot Model Diagnosis (ZOOM) is possible without the need for a test set nor labeling.
arXiv Detail & Related papers (2023-03-27T17:59:33Z) - CARLA-GeAR: a Dataset Generator for a Systematic Evaluation of
Adversarial Robustness of Vision Models [61.68061613161187]
This paper presents CARLA-GeAR, a tool for the automatic generation of synthetic datasets for evaluating the robustness of neural models against physical adversarial patches.
The tool is built on the CARLA simulator, using its Python API, and allows the generation of datasets for several vision tasks in the context of autonomous driving.
The paper presents an experimental study to evaluate the performance of some defense methods against such attacks, showing how the datasets generated with CARLA-GeAR might be used in future work as a benchmark for adversarial defense in the real world.
arXiv Detail & Related papers (2022-06-09T09:17:38Z) - Autoencoder Attractors for Uncertainty Estimation [13.618797548020462]
We propose a novel approach for uncertainty estimation based on autoencoder models.
We evaluate our approach on several dataset combinations as well as on an industrial application for occupant classification in the vehicle interior.
arXiv Detail & Related papers (2022-04-01T12:10:06Z) - Towards an Automated Pipeline for Detecting and Classifying Malware
through Machine Learning [0.0]
We propose a malware taxonomic classification pipeline able to classify Windows Portable Executable files (PEs)
Given an input PE sample, it is first classified as either malicious or benign.
If malicious, the pipeline further analyzes it in order to establish its threat type, family, and behavior(s)
arXiv Detail & Related papers (2021-06-10T10:07:50Z) - Firearm Detection via Convolutional Neural Networks: Comparing a
Semantic Segmentation Model Against End-to-End Solutions [68.8204255655161]
Threat detection of weapons and aggressive behavior from live video can be used for rapid detection and prevention of potentially deadly incidents.
One way for achieving this is through the use of artificial intelligence and, in particular, machine learning for image analysis.
We compare a traditional monolithic end-to-end deep learning model and a previously proposed model based on an ensemble of simpler neural networks detecting fire-weapons via semantic segmentation.
arXiv Detail & Related papers (2020-12-17T15:19:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.