MLRan: A Behavioural Dataset for Ransomware Analysis and Detection
- URL: http://arxiv.org/abs/2505.18613v1
- Date: Sat, 24 May 2025 09:22:53 GMT
- Title: MLRan: A Behavioural Dataset for Ransomware Analysis and Detection
- Authors: Faithful Chiagoziem Onwuegbuche, Adelodun Olaoluwa, Anca Delia Jurcut, Liliana Pasquale,
- Abstract summary: We introduce MLRan, a behavioural ransomware dataset, comprising over 4,800 samples across 64 ransomware families and a balanced set of goodware samples.<n>The samples span from 2006 to 2024 and encompass the four major types of ransomware: locker, crypto, ransomware-as-a-service, and modern variants.<n>We evaluated the ransomware detection performance of several machine learning (ML) models using MLRan.
- Score: 0.7706236363202722
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Ransomware remains a critical threat to cybersecurity, yet publicly available datasets for training machine learning-based ransomware detection models are scarce and often have limited sample size, diversity, and reproducibility. In this paper, we introduce MLRan, a behavioural ransomware dataset, comprising over 4,800 samples across 64 ransomware families and a balanced set of goodware samples. The samples span from 2006 to 2024 and encompass the four major types of ransomware: locker, crypto, ransomware-as-a-service, and modern variants. We also propose guidelines (GUIDE-MLRan), inspired by previous work, for constructing high-quality behavioural ransomware datasets, which informed the curation of our dataset. We evaluated the ransomware detection performance of several machine learning (ML) models using MLRan. For this purpose, we performed feature selection by conducting mutual information filtering to reduce the initial 6.4 million features to 24,162, followed by recursive feature elimination, yielding 483 highly informative features. The ML models achieved an accuracy, precision and recall of up to 98.7%, 98.9%, 98.5%, respectively. Using SHAP and LIME, we identified critical indicators of malicious behaviour, including registry tampering, strings, and API misuse. The dataset and source code for feature extraction, selection, ML training, and evaluation are available publicly to support replicability and encourage future research, which can be found at https://github.com/faithfulco/mlran.
Related papers
- Rethinking and Exploring String-Based Malware Family Classification in the Era of LLMs and RAG [17.816068374958043]
Family-Specific String (FSS) features could be utilized in a manner similar to Retrieval-Augmented Generation (RAG) to facilitate Malware Family Classification (MFC)<n>We develop a curated evaluation framework covering 4,347 samples from 67 malware families, extract and analyze over 25 million strings, and conduct detailed ablation studies to assess the impact of different design choices in four major modules.
arXiv Detail & Related papers (2025-07-05T14:36:13Z) - A Sysmon Incremental Learning System for Ransomware Analysis and Detection [1.495391051525033]
In the face of increasing cyber threats, particularly ransomware attacks, there is a pressing need for advanced detection and analysis systems.<n>Most of these proposals leverage non-incremental learning approaches that require the underlying models to be updated from scratch to detect new ransomware.<n>This approach is problematic because it leaves sensitive data vulnerable to attack during retraining, as newly emerging ransomware strains may go undetected until the model is updated.<n>We present the Sysmon Incremental Learning System for Analysis and Detection (SILRAD), which enables continuous updates to the underlying model and effectively closes the training gap.
arXiv Detail & Related papers (2025-01-02T06:22:58Z) - Zero-day attack and ransomware detection [0.0]
This study uses the UGRansome dataset to train various Machine Learning models for zero-day and ransomware attacks detection.
The finding demonstrates that Random Forest (RFC), XGBoost, and Ensemble Methods achieved perfect scores in accuracy, precision, recall, and F1-score.
arXiv Detail & Related papers (2024-08-08T02:23:42Z) - Small Effect Sizes in Malware Detection? Make Harder Train/Test Splits! [51.668411293817464]
Industry practitioners care about small improvements in malware detection accuracy because their models are deployed to hundreds of millions of machines.
Academic research is often restrained to public datasets on the order of ten thousand samples.
We devise an approach to generate a benchmark of difficulty from a pool of available samples.
arXiv Detail & Related papers (2023-12-25T21:25:55Z) - Conservative Prediction via Data-Driven Confidence Minimization [70.93946578046003]
In safety-critical applications of machine learning, it is often desirable for a model to be conservative.
We propose the Data-Driven Confidence Minimization framework, which minimizes confidence on an uncertainty dataset.
arXiv Detail & Related papers (2023-06-08T07:05:36Z) - Exploring Model Dynamics for Accumulative Poisoning Discovery [62.08553134316483]
We propose a novel information measure, namely, Memorization Discrepancy, to explore the defense via the model-level information.
By implicitly transferring the changes in the data manipulation to that in the model outputs, Memorization Discrepancy can discover the imperceptible poison samples.
We thoroughly explore its properties and propose Discrepancy-aware Sample Correction (DSC) to defend against accumulative poisoning attacks.
arXiv Detail & Related papers (2023-06-06T14:45:24Z) - Unleashing Mask: Explore the Intrinsic Out-of-Distribution Detection
Capability [70.72426887518517]
Out-of-distribution (OOD) detection is an indispensable aspect of secure AI when deploying machine learning models in real-world applications.
We propose a novel method, Unleashing Mask, which aims to restore the OOD discriminative capabilities of the well-trained model with ID data.
Our method utilizes a mask to figure out the memorized atypical samples, and then finetune the model or prune it with the introduced mask to forget them.
arXiv Detail & Related papers (2023-06-06T14:23:34Z) - Behavioural Reports of Multi-Stage Malware [3.64414368529873]
This dataset provides API call sequences for thousands of malware samples executed in Windows 10 virtual machines.
A tutorial on how to create and expand this dataset is provided along with a benchmark demonstrating how to use this dataset to classify malware.
arXiv Detail & Related papers (2023-01-30T11:51:02Z) - Interpretable Machine Learning for Detection and Classification of
Ransomware Families Based on API Calls [5.340730281227837]
This research work utilizes the frequencies of different API calls to detect and classify ransomware families.
A WebCrawler is developed to automate collecting the Windows Portable Executable PE files of 15 different ransomware families.
Logistic Regression can efficiently classify ransomware into their corresponding families securing 9915 accuracy.
arXiv Detail & Related papers (2022-10-16T15:54:45Z) - MOVE: Effective and Harmless Ownership Verification via Embedded External Features [104.97541464349581]
We propose an effective and harmless model ownership verification (MOVE) to defend against different types of model stealing simultaneously.<n>We conduct the ownership verification by verifying whether a suspicious model contains the knowledge of defender-specified external features.<n>We then train a meta-classifier to determine whether a model is stolen from the victim.
arXiv Detail & Related papers (2022-08-04T02:22:29Z) - Towards a Fair Comparison and Realistic Design and Evaluation Framework
of Android Malware Detectors [63.75363908696257]
We analyze 10 influential research works on Android malware detection using a common evaluation framework.
We identify five factors that, if not taken into account when creating datasets and designing detectors, significantly affect the trained ML models.
We conclude that the studied ML-based detectors have been evaluated optimistically, which justifies the good published results.
arXiv Detail & Related papers (2022-05-25T08:28:08Z) - Being Single Has Benefits. Instance Poisoning to Deceive Malware
Classifiers [47.828297621738265]
We show how an attacker can launch a sophisticated and efficient poisoning attack targeting the dataset used to train a malware classifier.
As opposed to other poisoning attacks in the malware detection domain, our attack does not focus on malware families but rather on specific malware instances that contain an implanted trigger.
We propose a comprehensive detection approach that could serve as a future sophisticated defense against this newly discovered severe threat.
arXiv Detail & Related papers (2020-10-30T15:27:44Z) - Towards a Resilient Machine Learning Classifier -- a Case Study of
Ransomware Detection [5.560986338397972]
A machine learning (ML) classifier was built to detect ransomware (called crypto-ransomware)
We find that input/output activities of ransomware and the file-content entropy are unique traits to detect crypto-ransomware.
In addition to accuracy and resiliency, trustworthiness is the other key criteria for a quality detector.
arXiv Detail & Related papers (2020-03-13T18:02:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.