On the impact of dataset size and class imbalance in evaluating
machine-learning-based windows malware detection techniques
- URL: http://arxiv.org/abs/2206.06256v1
- Date: Mon, 13 Jun 2022 15:37:31 GMT
- Title: On the impact of dataset size and class imbalance in evaluating
machine-learning-based windows malware detection techniques
- Authors: David Illes
- Abstract summary: Some researchers use smaller datasets, and if dataset size has a significant impact on performance, that makes comparison of the published results difficult.
The project identified two key objectives, to understand if dataset size correlates to measured detector performance to an extent that prevents meaningful comparison of published results.
Results suggested that high accuracy scores don't necessarily translate to high real-world performance.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The purpose of this project was to collect and analyse data about the
comparability and real-life applicability of published results focusing on
Microsoft Windows malware, more specifically the impact of dataset size and
testing dataset imbalance on measured detector performance. Some researchers
use smaller datasets, and if dataset size has a significant impact on
performance, that makes comparison of the published results difficult.
Researchers also tend to use balanced datasets and accuracy as a metric for
testing. The former is not a true representation of reality, where benign
samples significantly outnumber malware, and the latter is approach is known to
be problematic for imbalanced problems. The project identified two key
objectives, to understand if dataset size correlates to measured detector
performance to an extent that prevents meaningful comparison of published
results, and to understand if good performance reported in published research
can be expected to perform well in a real-world deployment scenario. The
research's results suggested that dataset size does correlate with measured
detector performance to an extent that prevents meaningful comparison of
published results, and without understanding the nature of the training set
size-accuracy curve for published results conclusions between approaches on
which approach is "better" shouldn't be made solely based on accuracy scores.
Results also suggested that high accuracy scores don't necessarily translate to
high real-world performance.
Related papers
- The Impact of Train-Test Leakage on Machine Learning-based Android Malware Detection [6.9053043489744015]
We identify distinct Android apps that have identical or nearly identical app representations.
This will lead to a data leakage problem that inflates a machine learning model's performance.
We propose a leak-aware scheme to construct a machine learning-based Android malware detector.
arXiv Detail & Related papers (2024-10-25T08:04:01Z) - A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance.
Data selection has shown promise in identifying the most representative samples from the entire dataset.
We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z) - Performance of Human Annotators in Object Detection and Segmentation of Remotely Sensed Data [0.0]
This study aims to assess the influence of annotation strategies, levels of imbalanced data, and prior experience, on the performance of human annotators.
The experiment is conducted using images with a pixel size of 0.15textbf$m$, involving both expert and non-expert participants.
arXiv Detail & Related papers (2024-09-16T13:34:26Z) - Benchmark Transparency: Measuring the Impact of Data on Evaluation [6.307485015636125]
We propose an automated framework that measures the data point distribution across 6 different dimensions.
We use disproportional stratified sampling to measure how much the data distribution affects absolute (Acc/F1) and relative (Rank) model performance.
We find that the impact of the data is statistically significant and is often larger than the impact of changing the metric.
arXiv Detail & Related papers (2024-03-31T17:33:43Z) - Benchmarks for Detecting Measurement Tampering [2.9138729302304855]
We build four new text-based datasets to evaluate measurement tampering detection techniques on large language models.
The goal is to determine if examples where all measurements indicate the outcome occurred actually had the outcome occur, or if this was caused by measurement tampering.
We demonstrate techniques that outperform simple baselines on most datasets, but don't achieve maximum performance.
arXiv Detail & Related papers (2023-08-29T19:54:37Z) - Exploring the Effectiveness of Dataset Synthesis: An application of
Apple Detection in Orchards [68.95806641664713]
We explore the usability of Stable Diffusion 2.1-base for generating synthetic datasets of apple trees for object detection.
We train a YOLOv5m object detection model to predict apples in a real-world apple detection dataset.
Results demonstrate that the model trained on generated data is slightly underperforming compared to a baseline model trained on real-world images.
arXiv Detail & Related papers (2023-06-20T09:46:01Z) - Striving for data-model efficiency: Identifying data externalities on
group performance [75.17591306911015]
Building trustworthy, effective, and responsible machine learning systems hinges on understanding how differences in training data and modeling decisions interact to impact predictive performance.
We focus on a particular type of data-model inefficiency, in which adding training data from some sources can actually lower performance evaluated on key sub-groups of the population.
Our results indicate that data-efficiency is a key component of both accurate and trustworthy machine learning.
arXiv Detail & Related papers (2022-11-11T16:48:27Z) - Data-Centric Machine Learning in the Legal Domain [0.2624902795082451]
This paper explores how changes in a data set influence the measured performance of a model.
Using three publicly available data sets from the legal domain, we investigate how changes to their size, the train/test splits, and the human labelling accuracy impact the performance.
The observed effects are surprisingly pronounced, especially when the per-class performance is considered.
arXiv Detail & Related papers (2022-01-17T23:05:14Z) - DAPPER: Label-Free Performance Estimation after Personalization for
Heterogeneous Mobile Sensing [95.18236298557721]
We present DAPPER (Domain AdaPtation Performance EstimatoR) that estimates the adaptation performance in a target domain with unlabeled target data.
Our evaluation with four real-world sensing datasets compared against six baselines shows that DAPPER outperforms the state-of-the-art baseline by 39.8% in estimation accuracy.
arXiv Detail & Related papers (2021-11-22T08:49:33Z) - Doing Great at Estimating CATE? On the Neglected Assumptions in
Benchmark Comparisons of Treatment Effect Estimators [91.3755431537592]
We show that even in arguably the simplest setting, estimation under ignorability assumptions can be misleading.
We consider two popular machine learning benchmark datasets for evaluation of heterogeneous treatment effect estimators.
We highlight that the inherent characteristics of the benchmark datasets favor some algorithms over others.
arXiv Detail & Related papers (2021-07-28T13:21:27Z) - Evaluating representations by the complexity of learning low-loss
predictors [55.94170724668857]
We consider the problem of evaluating representations of data for use in solving a downstream task.
We propose to measure the quality of a representation by the complexity of learning a predictor on top of the representation that achieves low loss on a task of interest.
arXiv Detail & Related papers (2020-09-15T22:06:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.