Multifamily Malware Models
- URL: http://arxiv.org/abs/2207.00620v1
- Date: Mon, 27 Jun 2022 13:06:31 GMT
- Title: Multifamily Malware Models
- Authors: Samanvitha Basole and Fabio Di Troia and Mark Stamp
- Abstract summary: We conduct experiments based on byte $n$-gram features to quantify the relationship between the generality of the training dataset and the accuracy of the corresponding machine learning models.
We find that neighborhood-based algorithms generalize surprisingly well, far outperforming the other machine learning techniques considered.
- Score: 5.414308305392762
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: When training a machine learning model, there is likely to be a tradeoff
between accuracy and the diversity of the dataset. Previous research has shown
that if we train a model to detect one specific malware family, we generally
obtain stronger results as compared to a case where we train a single model on
multiple diverse families. However, during the detection phase, it would be
more efficient to have a single model that can reliably detect multiple
families, rather than having to score each sample against multiple models. In
this research, we conduct experiments based on byte $n$-gram features to
quantify the relationship between the generality of the training dataset and
the accuracy of the corresponding machine learning models, all within the
context of the malware detection problem. We find that neighborhood-based
algorithms generalize surprisingly well, far outperforming the other machine
learning techniques considered.
Related papers
- Small Effect Sizes in Malware Detection? Make Harder Train/Test Splits! [51.668411293817464]
Industry practitioners care about small improvements in malware detection accuracy because their models are deployed to hundreds of millions of machines.
Academic research is often restrained to public datasets on the order of ten thousand samples.
We devise an approach to generate a benchmark of difficulty from a pool of available samples.
arXiv Detail & Related papers (2023-12-25T21:25:55Z) - Fantastic Gains and Where to Find Them: On the Existence and Prospect of
General Knowledge Transfer between Any Pretrained Model [74.62272538148245]
We show that for arbitrary pairings of pretrained models, one model extracts significant data context unavailable in the other.
We investigate if it is possible to transfer such "complementary" knowledge from one model to another without performance degradation.
arXiv Detail & Related papers (2023-10-26T17:59:46Z) - Decoding the Secrets of Machine Learning in Malware Classification: A
Deep Dive into Datasets, Feature Extraction, and Model Performance [25.184668510417545]
We collect the largest balanced malware dataset so far with 67K samples from 670 families (100 samples each)
We train state-of-the-art models for malware detection and family classification using our dataset.
Our results reveal that static features perform better than dynamic features, and that combining both only provides marginal improvement over static features.
arXiv Detail & Related papers (2023-07-27T07:18:10Z) - Many tasks make light work: Learning to localise medical anomalies from
multiple synthetic tasks [2.912977051718473]
A growing interest in single-class modelling and out-of-distribution detection.
Fully supervised machine learning models cannot reliably identify classes not included in their training.
We make use of multiple visually-distinct synthetic anomaly learning tasks for both training and validation.
arXiv Detail & Related papers (2023-07-03T09:52:54Z) - Dataless Knowledge Fusion by Merging Weights of Language Models [51.8162883997512]
Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models.
This creates a barrier to fusing knowledge across individual models to yield a better single model.
We propose a dataless knowledge fusion method that merges models in their parameter space.
arXiv Detail & Related papers (2022-12-19T20:46:43Z) - Synthetic Model Combination: An Instance-wise Approach to Unsupervised
Ensemble Learning [92.89846887298852]
Consider making a prediction over new test data without any opportunity to learn from a training set of labelled data.
Give access to a set of expert models and their predictions alongside some limited information about the dataset used to train them.
arXiv Detail & Related papers (2022-10-11T10:20:31Z) - QuantifyML: How Good is my Machine Learning Model? [0.0]
QuantifyML aims to quantify the extent to which machine learning models have learned and generalized from the given data.
The formula is analyzed with off-the-shelf model counters to obtain precise counts with respect to different model behavior.
arXiv Detail & Related papers (2021-10-25T01:56:01Z) - Model-based micro-data reinforcement learning: what are the crucial
model properties and which model to choose? [0.2836066255205732]
We contribute to micro-data model-based reinforcement learning (MBRL) by rigorously comparing popular generative models.
We find that on an environment that requires multimodal posterior predictives, mixture density nets outperform all other models by a large margin.
We also found that deterministic models are on par, in fact they consistently (although non-significantly) outperform their probabilistic counterparts.
arXiv Detail & Related papers (2021-07-24T11:38:25Z) - ALT-MAS: A Data-Efficient Framework for Active Testing of Machine
Learning Algorithms [58.684954492439424]
We propose a novel framework to efficiently test a machine learning model using only a small amount of labeled test data.
The idea is to estimate the metrics of interest for a model-under-test using Bayesian neural network (BNN)
arXiv Detail & Related papers (2021-04-11T12:14:04Z) - Good Classifiers are Abundant in the Interpolating Regime [64.72044662855612]
We develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers.
We find that test errors tend to concentrate around a small typical value $varepsilon*$, which deviates substantially from the test error of worst-case interpolating model.
Our results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice.
arXiv Detail & Related papers (2020-06-22T21:12:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.