Related papers: New Datasets for Dynamic Malware Classification

New Datasets for Dynamic Malware Classification

URL: http://arxiv.org/abs/2111.15205v1
Date: Tue, 30 Nov 2021 08:31:16 GMT
Title: New Datasets for Dynamic Malware Classification
Authors: Berkant D\"uzg\"un, Aykut \c{C}ay{\i}r, Ferhat Demirk{\i}ran, Ceyda Nur Kayha, Buket Gen\c{c}ayd{\i}n and Hasan Da\u{g}
Abstract summary: We introduce two new, updated datasets of malicious software, VirusSamples and VirusShare. This paper analyzes multi-class malware classification performance of the balanced and imbalanced version of these two datasets. Results show that Support Vector Machine, achieves the highest score of 94% in the imbalanced VirusSample dataset. XGBoost, one of the most common gradient boosting-based models, achieves the highest score of 90% and 80%.in both versions of the VirusShare dataset.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Nowadays, malware and malware incidents are increasing daily, even with various anti-viruses systems and malware detection or classification methodologies. Many static, dynamic, and hybrid techniques have been presented to detect malware and classify them into malware families. Dynamic and hybrid malware classification methods have advantages over static malware classification methods by being highly efficient. Since it is difficult to mask malware behavior while executing than its underlying code in static malware classification, machine learning techniques have been the main focus of the security experts to detect malware and determine their families dynamically. The rapid increase of malware also brings the necessity of recent and updated datasets of malicious software. We introduce two new, updated datasets in this work: One with 9,795 samples obtained and compiled from VirusSamples and the one with 14,616 samples from VirusShare. This paper also analyzes multi-class malware classification performance of the balanced and imbalanced version of these two datasets by using Histogram-based gradient boosting, Random Forest, Support Vector Machine, and XGBoost models with API call-based dynamic malware classification. Results show that Support Vector Machine, achieves the highest score of 94% in the imbalanced VirusSample dataset, whereas the same model has 91% accuracy in the balanced VirusSample dataset. While XGBoost, one of the most common gradient boosting-based models, achieves the highest score of 90% and 80%.in both versions of the VirusShare dataset. This paper also presents the baseline results of VirusShare and VirusSample datasets by using the four most widely known machine learning techniques in dynamic malware classification literature. We believe that these two datasets and baseline results enable researchers in this field to test and validate their methods and approaches.

Related papers

Multi-label Classification for Android Malware Based on Active Learning [7.599125552187342]
We propose MLCDroid, an ML-based multi-label classification approach that can directly indicate the existence of pre-defined malicious behaviors. We compare the results of 70 algorithm combinations to evaluate the effectiveness (best at 73.3%). This is the first multi-label Android malware classification approach intending to provide more information on fine-grained malicious behaviors.
arXiv Detail & Related papers (2024-10-09T01:09:24Z)
Online Clustering of Known and Emerging Malware Families [1.2289361708127875]
It is essential to categorize malware samples according to their malicious characteristics. Online clustering algorithms help us to understand malware behavior and produce a quicker response to new threats. This paper introduces a novel machine learning-based model for the online clustering of malicious samples into malware families.
arXiv Detail & Related papers (2024-05-06T09:20:17Z)
MalDICT: Benchmark Datasets on Malware Behaviors, Platforms, Exploitation, and Packers [44.700094741798445]
Existing research on malware classification focuses almost exclusively on two tasks: distinguishing between malicious and benign files and classifying malware by family. We have identified four tasks which are under-represented in prior work: classification by behaviors that malware exhibit, platforms that malware run on, vulnerabilities that malware exploit, and packers that malware are packed with. We are releasing benchmark datasets for each of these four classification tasks, tagged using ClarAVy and comprising nearly 5.5 million malicious files in total.
arXiv Detail & Related papers (2023-10-18T04:36:26Z)
EMBERSim: A Large-Scale Databank for Boosting Similarity Search in Malware Analysis [48.5877840394508]
In recent years there has been a shift from quantifications-based malware detection towards machine learning. We propose to address the deficiencies in the space of similarity research on binary files, starting from EMBER. We enhance EMBER with similarity information as well as malware class tags, to enable further research in the similarity space.
arXiv Detail & Related papers (2023-10-03T06:58:45Z)
DRSM: De-Randomized Smoothing on Malware Classifier Providing Certified Robustness [58.23214712926585]
We develop a certified defense, DRSM (De-Randomized Smoothed MalConv), by redesigning the de-randomized smoothing technique for the domain of malware detection. Specifically, we propose a window ablation scheme to provably limit the impact of adversarial bytes while maximally preserving local structures of the executables. We are the first to offer certified robustness in the realm of static detection of malware executables.
arXiv Detail & Related papers (2023-03-20T17:25:22Z)
Behavioural Reports of Multi-Stage Malware [3.64414368529873]
This dataset provides API call sequences for thousands of malware samples executed in Windows 10 virtual machines. A tutorial on how to create and expand this dataset is provided along with a benchmark demonstrating how to use this dataset to classify malware.
arXiv Detail & Related papers (2023-01-30T11:51:02Z)
Using Static and Dynamic Malware features to perform Malware Ascription [0.0]
We employ various Static and Dynamic features of malicious executables to classify malware based on their family. We leverage Cuckoo Sandbox and machine learning to make progress in this research.
arXiv Detail & Related papers (2021-12-05T18:01:09Z)
Being Single Has Benefits. Instance Poisoning to Deceive Malware Classifiers [47.828297621738265]
We show how an attacker can launch a sophisticated and efficient poisoning attack targeting the dataset used to train a malware classifier. As opposed to other poisoning attacks in the malware detection domain, our attack does not focus on malware families but rather on specific malware instances that contain an implanted trigger. We propose a comprehensive detection approach that could serve as a future sophisticated defense against this newly discovered severe threat.
arXiv Detail & Related papers (2020-10-30T15:27:44Z)
Adversarial EXEmples: A Survey and Experimental Evaluation of Practical Attacks on Machine Learning for Windows Malware Detection [67.53296659361598]
adversarial EXEmples can bypass machine learning-based detection by perturbing relatively few input bytes. We develop a unifying framework that does not only encompass and generalize previous attacks against machine-learning models, but also includes three novel attacks. These attacks, named Full DOS, Extend and Shift, inject the adversarial payload by respectively manipulating the DOS header, extending it, and shifting the content of the first section.
arXiv Detail & Related papers (2020-08-17T07:16:57Z)
DAEMON: Dataset-Agnostic Explainable Malware Classification Using Multi-Stage Feature Mining [3.04585143845864]
Malware classification is the task of determining to which family a new malicious variant belongs. We present DAEMON, a novel dataset-agnostic malware classification tool.
arXiv Detail & Related papers (2020-08-04T21:57:30Z)
Maat: Automatically Analyzing VirusTotal for Accurate Labeling and Effective Malware Detection [71.84087757644708]
The malware analysis and detection research community relies on the online platform VirusTotal to label Android apps based on the scan results of around 60 scanners. There are no standards on how to best interpret the scan results acquired from VirusTotal, which leads to the utilization of different threshold-based labeling strategies. We implemented a method, Maat, that tackles these issues of standardization and sustainability by automatically generating a Machine Learning (ML)-based labeling scheme.
arXiv Detail & Related papers (2020-07-01T14:15:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.