Parallel Instance Filtering for Malware Detection
- URL: http://arxiv.org/abs/2206.13889v1
- Date: Tue, 28 Jun 2022 11:14:20 GMT
- Title: Parallel Instance Filtering for Malware Detection
- Authors: Martin Jure\v{c}ek and Olha Jure\v{c}kov\'a
- Abstract summary: This work presents a new parallel instance selection algorithm called Parallel Instance Filtering (PIF)
The main idea of the algorithm is to split the data set into non-overlapping subsets of instances covering the whole data set and apply a filtering process for each subset.
We compare the PIF algorithm with several state-of-the-art instance selection algorithms on a large data set of 500,000 malicious and benign samples.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Machine learning algorithms are widely used in the area of malware detection.
With the growth of sample amounts, training of classification algorithms
becomes more and more expensive. In addition, training data sets may contain
redundant or noisy instances. The problem to be solved is how to select
representative instances from large training data sets without reducing the
accuracy. This work presents a new parallel instance selection algorithm called
Parallel Instance Filtering (PIF). The main idea of the algorithm is to split
the data set into non-overlapping subsets of instances covering the whole data
set and apply a filtering process for each subset. Each subset consists of
instances that have the same nearest enemy. As a result, the PIF algorithm is
fast since subsets are processed independently of each other using parallel
computation. We compare the PIF algorithm with several state-of-the-art
instance selection algorithms on a large data set of 500,000 malicious and
benign samples. The feature set was extracted using static analysis, and it
includes metadata from the portable executable file format. Our experimental
results demonstrate that the proposed instance selection algorithm reduces the
size of a training data set significantly with the only slightly decreased
accuracy. The PIF algorithm outperforms existing instance selection methods
used in the experiments in terms of the ratio between average classification
accuracy and storage percentage.
Related papers
- A Mirror Descent-Based Algorithm for Corruption-Tolerant Distributed Gradient Descent [57.64826450787237]
We show how to analyze the behavior of distributed gradient descent algorithms in the presence of adversarial corruptions.
We show how to use ideas from (lazy) mirror descent to design a corruption-tolerant distributed optimization algorithm.
Experiments based on linear regression, support vector classification, and softmax classification on the MNIST dataset corroborate our theoretical findings.
arXiv Detail & Related papers (2024-07-19T08:29:12Z) - Data Classification With Multiprocessing [6.513930657238705]
Python multiprocessing is used to test this hypothesis with different classification algorithms.
We conclude that ensembling improves accuracy and multiprocessing reduces execution time for selected algorithms.
arXiv Detail & Related papers (2023-12-23T03:42:13Z) - FairWASP: Fast and Optimal Fair Wasserstein Pre-processing [9.627848184502783]
We present FairWASP, a novel pre-processing approach to reduce disparities in classification datasets without modifying the original data.
We show theoretically that integer weights are optimal, which means our method can be equivalently understood as duplicating or eliminating samples.
Our work is based on reformulating the pre-processing task as a large-scale mixed-integer program (MIP), for which we propose a highly efficient algorithm based on the cutting plane method.
arXiv Detail & Related papers (2023-10-31T19:36:00Z) - Efficient Approximate Kernel Based Spike Sequence Classification [56.2938724367661]
Machine learning models, such as SVM, require a definition of distance/similarity between pairs of sequences.
Exact methods yield better classification performance, but they pose high computational costs.
We propose a series of ways to improve the performance of the approximate kernel in order to enhance its predictive performance.
arXiv Detail & Related papers (2022-09-11T22:44:19Z) - Towards Automated Imbalanced Learning with Deep Hierarchical
Reinforcement Learning [57.163525407022966]
Imbalanced learning is a fundamental challenge in data mining, where there is a disproportionate ratio of training samples in each class.
Over-sampling is an effective technique to tackle imbalanced learning through generating synthetic samples for the minority class.
We propose AutoSMOTE, an automated over-sampling algorithm that can jointly optimize different levels of decisions.
arXiv Detail & Related papers (2022-08-26T04:28:01Z) - Machine Learning for Online Algorithm Selection under Censored Feedback [71.6879432974126]
In online algorithm selection (OAS), instances of an algorithmic problem class are presented to an agent one after another, and the agent has to quickly select a presumably best algorithm from a fixed set of candidate algorithms.
For decision problems such as satisfiability (SAT), quality typically refers to the algorithm's runtime.
In this work, we revisit multi-armed bandit algorithms for OAS and discuss their capability of dealing with the problem.
We adapt them towards runtime-oriented losses, allowing for partially censored data while keeping a space- and time-complexity independent of the time horizon.
arXiv Detail & Related papers (2021-09-13T18:10:52Z) - Optimal Sampling Gaps for Adaptive Submodular Maximization [28.24164217929491]
We study the performance loss caused by probability sampling in the context of adaptive submodular.
We show that the property of policywise submodular can be found in a wide range of real-world applications.
arXiv Detail & Related papers (2021-04-05T03:21:32Z) - The Integrity of Machine Learning Algorithms against Software Defect
Prediction [0.0]
This report analyses the performance of the Online Sequential Extreme Learning Machine (OS-ELM) proposed by Liang et.al.
OS-ELM trains faster than conventional deep neural networks and it always converges to the globally optimal solution.
The analysis is carried out on 3 projects KC1, PC4 and PC3 carried out by the NASA group.
arXiv Detail & Related papers (2020-09-05T17:26:56Z) - Non-Adaptive Adaptive Sampling on Turnstile Streams [57.619901304728366]
We give the first relative-error algorithms for column subset selection, subspace approximation, projective clustering, and volume on turnstile streams that use space sublinear in $n$.
Our adaptive sampling procedure has a number of applications to various data summarization problems that either improve state-of-the-art or have only been previously studied in the more relaxed row-arrival model.
arXiv Detail & Related papers (2020-04-23T05:00:21Z) - LSF-Join: Locality Sensitive Filtering for Distributed All-Pairs Set
Similarity Under Skew [58.21885402826496]
All-pairs set similarity is a widely used data mining task, even for large and high-dimensional datasets.
We present a new distributed algorithm, LSF-Join, for approximate all-pairs set similarity.
We show that LSF-Join efficiently finds most close pairs, even for small similarity thresholds and for skewed input sets.
arXiv Detail & Related papers (2020-03-06T00:06:20Z) - Fase-AL -- Adaptation of Fast Adaptive Stacking of Ensembles for
Supporting Active Learning [0.0]
This work presents the FASE-AL algorithm which induces classification models with non-labeled instances using Active Learning.
The algorithm achieves promising results in terms of the percentage of correctly classified instances.
arXiv Detail & Related papers (2020-01-30T17:25:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.