Related papers: A Comparison of Word2Vec, HMM2Vec, and PCA2Vec for Malware Classification

A Comparison of Word2Vec, HMM2Vec, and PCA2Vec for Malware Classification

URL: http://arxiv.org/abs/2103.05763v1
Date: Sun, 7 Mar 2021 14:41:18 GMT
Title: A Comparison of Word2Vec, HMM2Vec, and PCA2Vec for Malware Classification
Authors: Aniket Chandak and Wendy Lee and Mark Stamp
Abstract summary: We first consider multiple different word embedding techniques within the context of malware classification. We derive feature embeddings based on opcode sequences for malware samples from a variety of different families. We show that we can obtain better classification accuracy based on these feature embeddings.
Score: 3.0969191504482247
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Word embeddings are often used in natural language processing as a means to quantify relationships between words. More generally, these same word embedding techniques can be used to quantify relationships between features. In this paper, we first consider multiple different word embedding techniques within the context of malware classification. We use hidden Markov models to obtain embedding vectors in an approach that we refer to as HMM2Vec, and we generate vector embeddings based on principal component analysis. We also consider the popular neural network based word embedding technique known as Word2Vec. In each case, we derive feature embeddings based on opcode sequences for malware samples from a variety of different families. We show that we can obtain better classification accuracy based on these feature embeddings, as compared to HMM experiments that directly use the opcode sequences, and serve to establish a baseline. These results show that word embeddings can be a useful feature engineering step in the field of malware analysis.

Related papers

Word Embedding Techniques for Classification of Star Ratings [0.0]
This research uses a novel dataset of telecom customers' reviews to perform an extensive study showing how different word embedding algorithms can affect the text classification process. Several state-of-the-art word embedding techniques are considered, including BERT, Word2Vec and Doc2Vec, coupled with several classification algorithms.
arXiv Detail & Related papers (2025-04-18T12:26:28Z)
A General and Flexible Multi-concept Parsing Framework for Multilingual Semantic Matching [60.51839859852572]
We propose to resolve the text into multi concepts for multilingual semantic matching to liberate the model from the reliance on NER models. We conduct comprehensive experiments on English datasets QQP and MRPC, and Chinese dataset Medical-SM.
arXiv Detail & Related papers (2024-03-05T13:55:16Z)
Beyond Contrastive Learning: A Variational Generative Model for Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings. Our model operates on parallel data in $N$ languages. We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z)
Multi hash embeddings in spaCy [1.6790532021482656]
spaCy is a machine learning system that generates multi-embedding representations of words. The default embedding layer in spaCy is a hash embeddings layer. In this technical report we lay out a bit of history and introduce the embedding methods in spaCy in detail.
arXiv Detail & Related papers (2022-12-19T06:03:04Z)
Malware Classification with Word Embedding Features [6.961253535504979]
Modern malware classification techniques rely on machine learning models that can be trained on features such as opcode sequences. We implement hybrid machine learning techniques, where we engineer feature vectors by training hidden Markov models. We conduct substantial experiments over a variety of malware families.
arXiv Detail & Related papers (2021-03-03T21:57:11Z)
Fake it Till You Make it: Self-Supervised Semantic Shifts for Monolingual Word Embedding Tasks [58.87961226278285]
We propose a self-supervised approach to model lexical semantic change. We show that our method can be used for the detection of semantic change with any alignment method. We illustrate the utility of our techniques using experimental results on three different datasets.
arXiv Detail & Related papers (2021-01-30T18:59:43Z)
Integration of Domain Knowledge using Medical Knowledge Graph Deep Learning for Cancer Phenotyping [6.077023952306772]
We propose a method to integrate external knowledge from medical terminology into the context captured by word embeddings. We evaluate the proposed approach using a Multitask Convolutional Neural Network (MT-CNN) to extract six cancer characteristics from a dataset of 900K cancer pathology reports.
arXiv Detail & Related papers (2021-01-05T03:59:43Z)
R$^2$-Net: Relation of Relation Learning Network for Sentence Semantic Matching [58.72111690643359]
We propose a Relation of Relation Learning Network (R2-Net) for sentence semantic matching. We first employ BERT to encode the input sentences from a global perspective. Then a CNN-based encoder is designed to capture keywords and phrase information from a local perspective. To fully leverage labels for better relation information extraction, we introduce a self-supervised relation of relation classification task.
arXiv Detail & Related papers (2020-12-16T13:11:30Z)
SChME at SemEval-2020 Task 1: A Model Ensemble for Detecting Lexical Semantic Change [58.87961226278285]
This paper describes SChME, a method used in SemEval-2020 Task 1 on unsupervised detection of lexical semantic change. SChME usesa model ensemble combining signals of distributional models (word embeddings) and wordfrequency models where each model casts a vote indicating the probability that a word sufferedsemantic change according to that feature.
arXiv Detail & Related papers (2020-12-02T23:56:34Z)
Robust and Consistent Estimation of Word Embedding for Bangla Language by fine-tuning Word2Vec Model [1.2691047660244335]
We analyze word2vec model for learning word vectors and present the most effective word embedding for Bangla language. We cluster the word vectors to examine the relational similarity of words for intrinsic evaluation and also use different word embeddings as the feature of news article for extrinsic evaluation.
arXiv Detail & Related papers (2020-10-26T08:00:48Z)
On the Learnability of Concepts: With Applications to Comparing Word Embedding Algorithms [0.0]
We introduce the notion of "concept" as a list of words that have shared semantic content. We first use this notion to measure the learnability of concepts on pretrained word embeddings. We then develop a statistical analysis of concept learnability, based on hypothesis testing and ROC curves, in order to compare the relative merits of various embedding algorithms.
arXiv Detail & Related papers (2020-06-17T14:25:36Z)
A Comparative Study of Lexical Substitution Approaches based on Neural Language Models [117.96628873753123]
We present a large-scale comparative study of popular neural language and masked language models. We show that already competitive results achieved by SOTA LMs/MLMs can be further improved if information about the target word is injected properly.
arXiv Detail & Related papers (2020-05-29T18:43:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.