Malware Classification with Word Embedding Features
- URL: http://arxiv.org/abs/2103.02711v1
- Date: Wed, 3 Mar 2021 21:57:11 GMT
- Title: Malware Classification with Word Embedding Features
- Authors: Aparna Sunil Kale and Fabio Di Troia and Mark Stamp
- Abstract summary: Modern malware classification techniques rely on machine learning models that can be trained on features such as opcode sequences.
We implement hybrid machine learning techniques, where we engineer feature vectors by training hidden Markov models.
We conduct substantial experiments over a variety of malware families.
- Score: 6.961253535504979
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Malware classification is an important and challenging problem in information
security. Modern malware classification techniques rely on machine learning
models that can be trained on features such as opcode sequences, API calls, and
byte $n$-grams, among many others. In this research, we consider opcode
features. We implement hybrid machine learning techniques, where we engineer
feature vectors by training hidden Markov models -- a technique that we refer
to as HMM2Vec -- and Word2Vec embeddings on these opcode sequences. The
resulting HMM2Vec and Word2Vec embedding vectors are then used as features for
classification algorithms. Specifically, we consider support vector machine
(SVM), $k$-nearest neighbor ($k$-NN), random forest (RF), and convolutional
neural network (CNN) classifiers. We conduct substantial experiments over a
variety of malware families. Our experiments extend well beyond any previous
work in this field.
Related papers
- Toward Multi-class Anomaly Detection: Exploring Class-aware Unified Model against Inter-class Interference [67.36605226797887]
We introduce a Multi-class Implicit Neural representation Transformer for unified Anomaly Detection (MINT-AD)
By learning the multi-class distributions, the model generates class-aware query embeddings for the transformer decoder.
MINT-AD can project category and position information into a feature embedding space, further supervised by classification and prior probability loss functions.
arXiv Detail & Related papers (2024-03-21T08:08:31Z) - Enhancing Malware Detection by Integrating Machine Learning with Cuckoo
Sandbox [0.0]
This study aims to classify and identify malware extracted from a dataset containing API call sequences.
Both deep learning and machine learning algorithms achieve remarkably high levels of accuracy, reaching up to 99% in certain cases.
arXiv Detail & Related papers (2023-11-07T22:33:17Z) - Online Clustered Codebook [100.1650001618827]
We present a simple alternative method for online codebook learning, Clustering VQ-VAE (CVQ-VAE)
Our approach selects encoded features as anchors to update the dead'' codevectors, while optimising the codebooks which are alive via the original loss.
Our CVQ-VAE can be easily integrated into the existing models with just a few lines of code.
arXiv Detail & Related papers (2023-07-27T18:31:04Z) - A Natural Language Processing Approach to Malware Classification [2.707154152696381]
In this research, we consider a hybrid architecture, where Hidden Markov Models (HMM) are trained on opcode sequences.
extracting the HMM hidden state sequences can be viewed as a form of feature engineering.
We find that this NLP-based approach outperforms other popular techniques on a challenging malware dataset.
arXiv Detail & Related papers (2023-07-07T23:16:23Z) - Class-Incremental Learning: A Survey [84.30083092434938]
Class-Incremental Learning (CIL) enables the learner to incorporate the knowledge of new classes incrementally.
CIL tends to catastrophically forget the characteristics of former ones, and its performance drastically degrades.
We provide a rigorous and unified evaluation of 17 methods in benchmark image classification tasks to find out the characteristics of different algorithms.
arXiv Detail & Related papers (2023-02-07T17:59:05Z) - Quality-Aware Decoding for Neural Machine Translation [64.24934199944875]
We propose quality-aware decoding for neural machine translation (NMT)
We leverage recent breakthroughs in reference-free and reference-based MT evaluation through various inference methods.
We find that quality-aware decoding consistently outperforms MAP-based decoding according both to state-of-the-art automatic metrics and to human assessments.
arXiv Detail & Related papers (2022-05-02T15:26:28Z) - Rethinking Nearest Neighbors for Visual Classification [56.00783095670361]
k-NN is a lazy learning method that aggregates the distance between the test image and top-k neighbors in a training set.
We adopt k-NN with pre-trained visual representations produced by either supervised or self-supervised methods in two steps.
Via extensive experiments on a wide range of classification tasks, our study reveals the generality and flexibility of k-NN integration.
arXiv Detail & Related papers (2021-12-15T20:15:01Z) - HyperSeed: Unsupervised Learning with Vector Symbolic Architectures [5.258404928739212]
This paper presents a novel unsupervised machine learning approach named Hyperseed.
It leverages Vector Symbolic Architectures (VSA) for fast learning a topology preserving feature map of unlabelled data.
The two distinctive novelties of the Hyperseed algorithm are 1) Learning from only few input data samples and 2) A learning rule based on a single vector operation.
arXiv Detail & Related papers (2021-10-15T20:05:43Z) - CNN vs ELM for Image-Based Malware Classification [3.4806267677524896]
We train and evaluate machine learning models for malware classification, based on features that can be obtained without disassembly or execution of code.
We find that ELMs can achieve accuracies on par with CNNs, yet ELM training requires less than2% of the time needed to train a comparable CNN.
arXiv Detail & Related papers (2021-03-24T00:51:06Z) - A Comparison of Word2Vec, HMM2Vec, and PCA2Vec for Malware
Classification [3.0969191504482247]
We first consider multiple different word embedding techniques within the context of malware classification.
We derive feature embeddings based on opcode sequences for malware samples from a variety of different families.
We show that we can obtain better classification accuracy based on these feature embeddings.
arXiv Detail & Related papers (2021-03-07T14:41:18Z) - Many-Class Few-Shot Learning on Multi-Granularity Class Hierarchy [57.68486382473194]
We study many-class few-shot (MCFS) problem in both supervised learning and meta-learning settings.
In this paper, we leverage the class hierarchy as a prior knowledge to train a coarse-to-fine classifier.
The model, "memory-augmented hierarchical-classification network (MahiNet)", performs coarse-to-fine classification where each coarse class can cover multiple fine classes.
arXiv Detail & Related papers (2020-06-28T01:11:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.