MOTIF: A Large Malware Reference Dataset with Ground Truth Family Labels
- URL: http://arxiv.org/abs/2111.15031v1
- Date: Mon, 29 Nov 2021 23:59:50 GMT
- Title: MOTIF: A Large Malware Reference Dataset with Ground Truth Family Labels
- Authors: Robert J. Joyce, Dev Amlani, Charles Nicholas, Edward Raff
- Abstract summary: We have created the Malware Open-source Threat Intelligence Family (MOTIF) dataset.
MOTIF contains 3,095 malware samples from 454 families, making it the largest and most diverse public malware dataset.
We provide aliases of the different names used to describe the same malware family, allowing us to benchmark for the first time accuracy of existing tools.
- Score: 21.050311121388813
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Malware family classification is a significant issue with public safety and
research implications that has been hindered by the high cost of expert labels.
The vast majority of corpora use noisy labeling approaches that obstruct
definitive quantification of results and study of deeper interactions. In order
to provide the data needed to advance further, we have created the Malware
Open-source Threat Intelligence Family (MOTIF) dataset. MOTIF contains 3,095
malware samples from 454 families, making it the largest and most diverse
public malware dataset with ground truth family labels to date, nearly 3x
larger than any prior expert-labeled corpus and 36x larger than the prior
Windows malware corpus. MOTIF also comes with a mapping from malware samples to
threat reports published by reputable industry sources, which both validates
the labels and opens new research opportunities in connecting opaque malware
samples to human-readable descriptions. This enables important evaluations that
are normally infeasible due to non-standardized reporting in industry. For
example, we provide aliases of the different names used to describe the same
malware family, allowing us to benchmark for the first time accuracy of
existing tools when names are obtained from differing sources. Evaluation
results obtained using the MOTIF dataset indicate that existing tasks have
significant room for improvement, with accuracy of antivirus majority voting
measured at only 62.10% and the well-known AVClass tool having just 46.78%
accuracy. Our findings indicate that malware family classification suffers a
type of labeling noise unlike that studied in most ML literature, due to the
large open set of classes that may not be known from the sample under
consideration
Related papers
- Multi-label Classification for Android Malware Based on Active Learning [7.599125552187342]
We propose MLCDroid, an ML-based multi-label classification approach that can directly indicate the existence of pre-defined malicious behaviors.
We compare the results of 70 algorithm combinations to evaluate the effectiveness (best at 73.3%).
This is the first multi-label Android malware classification approach intending to provide more information on fine-grained malicious behaviors.
arXiv Detail & Related papers (2024-10-09T01:09:24Z) - MalDICT: Benchmark Datasets on Malware Behaviors, Platforms, Exploitation, and Packers [44.700094741798445]
Existing research on malware classification focuses almost exclusively on two tasks: distinguishing between malicious and benign files and classifying malware by family.
We have identified four tasks which are under-represented in prior work: classification by behaviors that malware exhibit, platforms that malware run on, vulnerabilities that malware exploit, and packers that malware are packed with.
We are releasing benchmark datasets for each of these four classification tasks, tagged using ClarAVy and comprising nearly 5.5 million malicious files in total.
arXiv Detail & Related papers (2023-10-18T04:36:26Z) - EMBERSim: A Large-Scale Databank for Boosting Similarity Search in
Malware Analysis [48.5877840394508]
In recent years there has been a shift from quantifications-based malware detection towards machine learning.
We propose to address the deficiencies in the space of similarity research on binary files, starting from EMBER.
We enhance EMBER with similarity information as well as malware class tags, to enable further research in the similarity space.
arXiv Detail & Related papers (2023-10-03T06:58:45Z) - CNS-Net: Conservative Novelty Synthesizing Network for Malware
Recognition in an Open-set Scenario [14.059646012441313]
We study the challenging task of malware recognition on both known and novel unknown malware families, called malware open-set recognition (MOSR)
In this paper, we propose a novel model that can conservatively synthesize malware instances to mimic unknown malware families.
We also build a new large-scale malware dataset, named MAL-100, to fill the gap of lacking large open-set malware benchmark dataset.
arXiv Detail & Related papers (2023-05-02T07:31:42Z) - DRSM: De-Randomized Smoothing on Malware Classifier Providing Certified
Robustness [58.23214712926585]
We develop a certified defense, DRSM (De-Randomized Smoothed MalConv), by redesigning the de-randomized smoothing technique for the domain of malware detection.
Specifically, we propose a window ablation scheme to provably limit the impact of adversarial bytes while maximally preserving local structures of the executables.
We are the first to offer certified robustness in the realm of static detection of malware executables.
arXiv Detail & Related papers (2023-03-20T17:25:22Z) - Towards a Fair Comparison and Realistic Design and Evaluation Framework
of Android Malware Detectors [63.75363908696257]
We analyze 10 influential research works on Android malware detection using a common evaluation framework.
We identify five factors that, if not taken into account when creating datasets and designing detectors, significantly affect the trained ML models.
We conclude that the studied ML-based detectors have been evaluated optimistically, which justifies the good published results.
arXiv Detail & Related papers (2022-05-25T08:28:08Z) - Beyond the Hype: A Real-World Evaluation of the Impact and Cost of
Machine Learning-Based Malware Detection [5.876081415416375]
There is a lack of scientific testing of commercially available malware detectors.
We present a scientific evaluation of four market-leading malware detection tools.
Our results show that all four tools have near-perfect precision but alarmingly low recall.
arXiv Detail & Related papers (2020-12-16T19:10:00Z) - Being Single Has Benefits. Instance Poisoning to Deceive Malware
Classifiers [47.828297621738265]
We show how an attacker can launch a sophisticated and efficient poisoning attack targeting the dataset used to train a malware classifier.
As opposed to other poisoning attacks in the malware detection domain, our attack does not focus on malware families but rather on specific malware instances that contain an implanted trigger.
We propose a comprehensive detection approach that could serve as a future sophisticated defense against this newly discovered severe threat.
arXiv Detail & Related papers (2020-10-30T15:27:44Z) - DAEMON: Dataset-Agnostic Explainable Malware Classification Using
Multi-Stage Feature Mining [3.04585143845864]
Malware classification is the task of determining to which family a new malicious variant belongs.
We present DAEMON, a novel dataset-agnostic malware classification tool.
arXiv Detail & Related papers (2020-08-04T21:57:30Z) - Maat: Automatically Analyzing VirusTotal for Accurate Labeling and
Effective Malware Detection [71.84087757644708]
The malware analysis and detection research community relies on the online platform VirusTotal to label Android apps based on the scan results of around 60 scanners.
There are no standards on how to best interpret the scan results acquired from VirusTotal, which leads to the utilization of different threshold-based labeling strategies.
We implemented a method, Maat, that tackles these issues of standardization and sustainability by automatically generating a Machine Learning (ML)-based labeling scheme.
arXiv Detail & Related papers (2020-07-01T14:15:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.