A Natural Language Processing Approach to Malware Classification
- URL: http://arxiv.org/abs/2307.11032v1
- Date: Fri, 7 Jul 2023 23:16:23 GMT
- Title: A Natural Language Processing Approach to Malware Classification
- Authors: Ritik Mehta and Olha Jure\v{c}kov\'a and Mark Stamp
- Abstract summary: In this research, we consider a hybrid architecture, where Hidden Markov Models (HMM) are trained on opcode sequences.
extracting the HMM hidden state sequences can be viewed as a form of feature engineering.
We find that this NLP-based approach outperforms other popular techniques on a challenging malware dataset.
- Score: 2.707154152696381
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Many different machine learning and deep learning techniques have been
successfully employed for malware detection and classification. Examples of
popular learning techniques in the malware domain include Hidden Markov Models
(HMM), Random Forests (RF), Convolutional Neural Networks (CNN), Support Vector
Machines (SVM), and Recurrent Neural Networks (RNN) such as Long Short-Term
Memory (LSTM) networks. In this research, we consider a hybrid architecture,
where HMMs are trained on opcode sequences, and the resulting hidden states of
these trained HMMs are used as feature vectors in various classifiers. In this
context, extracting the HMM hidden state sequences can be viewed as a form of
feature engineering that is somewhat analogous to techniques that are commonly
employed in Natural Language Processing (NLP). We find that this NLP-based
approach outperforms other popular techniques on a challenging malware dataset,
with an HMM-Random Forrest model yielding the best results.
Related papers
- Enhancing Malware Detection by Integrating Machine Learning with Cuckoo
Sandbox [0.0]
This study aims to classify and identify malware extracted from a dataset containing API call sequences.
Both deep learning and machine learning algorithms achieve remarkably high levels of accuracy, reaching up to 99% in certain cases.
arXiv Detail & Related papers (2023-11-07T22:33:17Z) - Heterogenous Memory Augmented Neural Networks [84.29338268789684]
We introduce a novel heterogeneous memory augmentation approach for neural networks.
By introducing learnable memory tokens with attention mechanism, we can effectively boost performance without huge computational overhead.
We show our approach on various image and graph-based tasks under both in-distribution (ID) and out-of-distribution (OOD) conditions.
arXiv Detail & Related papers (2023-10-17T01:05:28Z) - How neural networks learn to classify chaotic time series [77.34726150561087]
We study the inner workings of neural networks trained to classify regular-versus-chaotic time series.
We find that the relation between input periodicity and activation periodicity is key for the performance of LKCNN models.
arXiv Detail & Related papers (2023-06-04T08:53:27Z) - Intelligence Processing Units Accelerate Neuromorphic Learning [52.952192990802345]
Spiking neural networks (SNNs) have achieved orders of magnitude improvement in terms of energy consumption and latency.
We present an IPU-optimized release of our custom SNN Python package, snnTorch.
arXiv Detail & Related papers (2022-11-19T15:44:08Z) - Concurrent Neural Tree and Data Preprocessing AutoML for Image
Classification [0.5735035463793008]
Current state-of-the-art (SOTA) methods do not include traditional methods for manipulating input data as part of the algorithmic search space.
We adapt the Evolutionary Multi-objective Algorithm Design Engine (EMADE), a multi-objective evolutionary search framework for traditional machine learning methods, to perform neural architecture search.
We show that including these methods as part of the search space shows potential to provide benefits to performance on the CIFAR-10 image classification benchmark dataset.
arXiv Detail & Related papers (2022-05-25T20:03:09Z) - Low-bit Quantization of Recurrent Neural Network Language Models Using
Alternating Direction Methods of Multipliers [67.688697838109]
This paper presents a novel method to train quantized RNNLMs from scratch using alternating direction methods of multipliers (ADMM)
Experiments on two tasks suggest the proposed ADMM quantization achieved a model size compression factor of up to 31 times over the full precision baseline RNNLMs.
arXiv Detail & Related papers (2021-11-29T09:30:06Z) - Gone Fishing: Neural Active Learning with Fisher Embeddings [55.08537975896764]
There is an increasing need for active learning algorithms that are compatible with deep neural networks.
This article introduces BAIT, a practical representation of tractable, and high-performing active learning algorithm for neural networks.
arXiv Detail & Related papers (2021-06-17T17:26:31Z) - Malware Classification with Word Embedding Features [6.961253535504979]
Modern malware classification techniques rely on machine learning models that can be trained on features such as opcode sequences.
We implement hybrid machine learning techniques, where we engineer feature vectors by training hidden Markov models.
We conduct substantial experiments over a variety of malware families.
arXiv Detail & Related papers (2021-03-03T21:57:11Z) - Introducing the Hidden Neural Markov Chain framework [7.85426761612795]
This paper proposes the original Hidden Neural Markov Chain (HNMC) framework, a new family of sequential neural models.
We propose three different models: the classic HNMC, the HNMC2, and the HNMC-CN.
It shows this new neural sequential framework's potential, which can open the way to new models and might eventually compete with the prevalent BiLSTM and BiGRU.
arXiv Detail & Related papers (2021-02-17T20:13:45Z) - A Convolutional Deep Markov Model for Unsupervised Speech Representation
Learning [32.59760685342343]
Probabilistic Latent Variable Models provide an alternative to self-supervised learning approaches for linguistic representation learning from speech.
In this work, we propose ConvDMM, a Gaussian state-space model with non-linear emission and transition functions modelled by deep neural networks.
When trained on a large scale speech dataset (LibriSpeech), ConvDMM produces features that significantly outperform multiple self-supervised feature extracting methods.
arXiv Detail & Related papers (2020-06-03T21:50:20Z) - The Microsoft Toolkit of Multi-Task Deep Neural Networks for Natural
Language Understanding [97.85957811603251]
We present MT-DNN, an open-source natural language understanding (NLU) toolkit that makes it easy for researchers and developers to train customized deep learning models.
Built upon PyTorch and Transformers, MT-DNN is designed to facilitate rapid customization for a broad spectrum of NLU tasks.
A unique feature of MT-DNN is its built-in support for robust and transferable learning using the adversarial multi-task learning paradigm.
arXiv Detail & Related papers (2020-02-19T03:05:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.