New Datasets for Dynamic Malware Classification
- URL: http://arxiv.org/abs/2111.15205v1
- Date: Tue, 30 Nov 2021 08:31:16 GMT
- Title: New Datasets for Dynamic Malware Classification
- Authors: Berkant D\"uzg\"un, Aykut \c{C}ay{\i}r, Ferhat Demirk{\i}ran, Ceyda
Nur Kayha, Buket Gen\c{c}ayd{\i}n and Hasan Da\u{g}
- Abstract summary: We introduce two new, updated datasets of malicious software, VirusSamples and VirusShare.
This paper analyzes multi-class malware classification performance of the balanced and imbalanced version of these two datasets.
Results show that Support Vector Machine, achieves the highest score of 94% in the imbalanced VirusSample dataset.
XGBoost, one of the most common gradient boosting-based models, achieves the highest score of 90% and 80%.in both versions of the VirusShare dataset.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Nowadays, malware and malware incidents are increasing daily, even with
various anti-viruses systems and malware detection or classification
methodologies. Many static, dynamic, and hybrid techniques have been presented
to detect malware and classify them into malware families. Dynamic and hybrid
malware classification methods have advantages over static malware
classification methods by being highly efficient. Since it is difficult to mask
malware behavior while executing than its underlying code in static malware
classification, machine learning techniques have been the main focus of the
security experts to detect malware and determine their families dynamically.
The rapid increase of malware also brings the necessity of recent and updated
datasets of malicious software. We introduce two new, updated datasets in this
work: One with 9,795 samples obtained and compiled from VirusSamples and the
one with 14,616 samples from VirusShare. This paper also analyzes multi-class
malware classification performance of the balanced and imbalanced version of
these two datasets by using Histogram-based gradient boosting, Random Forest,
Support Vector Machine, and XGBoost models with API call-based dynamic malware
classification. Results show that Support Vector Machine, achieves the highest
score of 94% in the imbalanced VirusSample dataset, whereas the same model has
91% accuracy in the balanced VirusSample dataset. While XGBoost, one of the
most common gradient boosting-based models, achieves the highest score of 90%
and 80%.in both versions of the VirusShare dataset. This paper also presents
the baseline results of VirusShare and VirusSample datasets by using the four
most widely known machine learning techniques in dynamic malware classification
literature. We believe that these two datasets and baseline results enable
researchers in this field to test and validate their methods and approaches.
Related papers
- Multi-label Classification for Android Malware Based on Active Learning [7.599125552187342]
We propose MLCDroid, an ML-based multi-label classification approach that can directly indicate the existence of pre-defined malicious behaviors.
We compare the results of 70 algorithm combinations to evaluate the effectiveness (best at 73.3%).
This is the first multi-label Android malware classification approach intending to provide more information on fine-grained malicious behaviors.
arXiv Detail & Related papers (2024-10-09T01:09:24Z) - Online Clustering of Known and Emerging Malware Families [1.2289361708127875]
It is essential to categorize malware samples according to their malicious characteristics.
Online clustering algorithms help us to understand malware behavior and produce a quicker response to new threats.
This paper introduces a novel machine learning-based model for the online clustering of malicious samples into malware families.
arXiv Detail & Related papers (2024-05-06T09:20:17Z) - MalDICT: Benchmark Datasets on Malware Behaviors, Platforms, Exploitation, and Packers [44.700094741798445]
Existing research on malware classification focuses almost exclusively on two tasks: distinguishing between malicious and benign files and classifying malware by family.
We have identified four tasks which are under-represented in prior work: classification by behaviors that malware exhibit, platforms that malware run on, vulnerabilities that malware exploit, and packers that malware are packed with.
We are releasing benchmark datasets for each of these four classification tasks, tagged using ClarAVy and comprising nearly 5.5 million malicious files in total.
arXiv Detail & Related papers (2023-10-18T04:36:26Z) - EMBERSim: A Large-Scale Databank for Boosting Similarity Search in
Malware Analysis [48.5877840394508]
In recent years there has been a shift from quantifications-based malware detection towards machine learning.
We propose to address the deficiencies in the space of similarity research on binary files, starting from EMBER.
We enhance EMBER with similarity information as well as malware class tags, to enable further research in the similarity space.
arXiv Detail & Related papers (2023-10-03T06:58:45Z) - DRSM: De-Randomized Smoothing on Malware Classifier Providing Certified
Robustness [58.23214712926585]
We develop a certified defense, DRSM (De-Randomized Smoothed MalConv), by redesigning the de-randomized smoothing technique for the domain of malware detection.
Specifically, we propose a window ablation scheme to provably limit the impact of adversarial bytes while maximally preserving local structures of the executables.
We are the first to offer certified robustness in the realm of static detection of malware executables.
arXiv Detail & Related papers (2023-03-20T17:25:22Z) - Behavioural Reports of Multi-Stage Malware [3.64414368529873]
This dataset provides API call sequences for thousands of malware samples executed in Windows 10 virtual machines.
A tutorial on how to create and expand this dataset is provided along with a benchmark demonstrating how to use this dataset to classify malware.
arXiv Detail & Related papers (2023-01-30T11:51:02Z) - Using Static and Dynamic Malware features to perform Malware Ascription [0.0]
We employ various Static and Dynamic features of malicious executables to classify malware based on their family.
We leverage Cuckoo Sandbox and machine learning to make progress in this research.
arXiv Detail & Related papers (2021-12-05T18:01:09Z) - Being Single Has Benefits. Instance Poisoning to Deceive Malware
Classifiers [47.828297621738265]
We show how an attacker can launch a sophisticated and efficient poisoning attack targeting the dataset used to train a malware classifier.
As opposed to other poisoning attacks in the malware detection domain, our attack does not focus on malware families but rather on specific malware instances that contain an implanted trigger.
We propose a comprehensive detection approach that could serve as a future sophisticated defense against this newly discovered severe threat.
arXiv Detail & Related papers (2020-10-30T15:27:44Z) - Adversarial EXEmples: A Survey and Experimental Evaluation of Practical
Attacks on Machine Learning for Windows Malware Detection [67.53296659361598]
adversarial EXEmples can bypass machine learning-based detection by perturbing relatively few input bytes.
We develop a unifying framework that does not only encompass and generalize previous attacks against machine-learning models, but also includes three novel attacks.
These attacks, named Full DOS, Extend and Shift, inject the adversarial payload by respectively manipulating the DOS header, extending it, and shifting the content of the first section.
arXiv Detail & Related papers (2020-08-17T07:16:57Z) - DAEMON: Dataset-Agnostic Explainable Malware Classification Using
Multi-Stage Feature Mining [3.04585143845864]
Malware classification is the task of determining to which family a new malicious variant belongs.
We present DAEMON, a novel dataset-agnostic malware classification tool.
arXiv Detail & Related papers (2020-08-04T21:57:30Z) - Maat: Automatically Analyzing VirusTotal for Accurate Labeling and
Effective Malware Detection [71.84087757644708]
The malware analysis and detection research community relies on the online platform VirusTotal to label Android apps based on the scan results of around 60 scanners.
There are no standards on how to best interpret the scan results acquired from VirusTotal, which leads to the utilization of different threshold-based labeling strategies.
We implemented a method, Maat, that tackles these issues of standardization and sustainability by automatically generating a Machine Learning (ML)-based labeling scheme.
arXiv Detail & Related papers (2020-07-01T14:15:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.