Related papers: Demystifying Feature Engineering in Malware Analysis of API Call Sequences

Demystifying Feature Engineering in Malware Analysis of API Call Sequences

URL: http://arxiv.org/abs/2512.01666v1
Date: Mon, 01 Dec 2025 13:36:42 GMT
Title: Demystifying Feature Engineering in Malware Analysis of API Call Sequences
Authors: Tianheng Qu, Hongsong Zhu, Limin Sun, Haining Wang, Haiqiang Fei, Zheng He, Zhi Li,
Abstract summary: Machine learning (ML) has been widely used to analyze API call sequences in malware analysis.<n>Traditional feature extraction is based on human domain knowledge.<n>There is a trend of using natural language processing (NLP) for automatic feature extraction.
Score: 12.196708313633522
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Machine learning (ML) has been widely used to analyze API call sequences in malware analysis, which typically requires the expertise of domain specialists to extract relevant features from raw data. The extracted features play a critical role in malware analysis. Traditional feature extraction is based on human domain knowledge, while there is a trend of using natural language processing (NLP) for automatic feature extraction. This raises a question: how do we effectively select features for malware analysis based on API call sequences? To answer it, this paper presents a comprehensive study of investigating the impact of feature engineering upon malware classification.We first conducted a comparative performance evaluation under three models, Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), and Transformer, with respect to knowledge-based and NLP-based feature engineering methods. We observed that models with knowledge-based feature engineering inputs generally outperform those using NLP-based across all metrics, especially under smaller sample sizes. Then we analyzed a complete set of data features from API call sequences, our analysis reveals that models often focus on features such as handles and virtual addresses, which vary across executions and are difficult for human analysts to interpret.

Related papers

Step-Level Sparse Autoencoder for Reasoning Process Interpretation [48.99201531966593]
Large Language Models (LLMs) have achieved strong complex reasoning capabilities through Chain-of-Thought (CoT) reasoning.<n>We propose step-level sparse autoencoder (SSAE), which serves as an analytical tool to disentangle different aspects of LLMs' reasoning steps into sparse features.<n> Experiments on multiple base models and reasoning tasks show the effectiveness of the extracted features.
arXiv Detail & Related papers (2026-03-03T14:25:02Z)
Understanding Generative AI Content with Embedding Models [4.662332573448995]
We show that deep neural networks (DNNs) implicitly engineer features by transforming their input data into hidden feature vectors called embeddings.<n>We find empirical evidence that there is intrinsic separability between real samples and those generated by artificial intelligence (AI)
arXiv Detail & Related papers (2024-08-19T22:07:05Z)
Notes on Applicability of Explainable AI Methods to Machine Learning Models Using Features Extracted by Persistent Homology [0.0]
Persistent homology (PH) has found wide-ranging applications in machine learning. The ability to achieve satisfactory levels of accuracy with relatively simple downstream machine learning models, when processing these extracted features, underlines the pipeline's superior interpretability. We explore the potential application of explainable AI methodologies to this PH-ML pipeline.
arXiv Detail & Related papers (2023-10-15T08:56:15Z)
Nebula: Self-Attention for Dynamic Malware Analysis [14.710331873072146]
We introduce Nebula, a versatile, self-attention Transformer-based neural architecture that generalizes across different behavioral representations and formats. We perform experiments on both malware detection and classification tasks, using three datasets acquired from different dynamic analyses platforms. We showcase how self-supervised learning pre-training matches the performance of fully-supervised models with only 20% of training data.
arXiv Detail & Related papers (2023-09-19T09:24:36Z)
PyRCA: A Library for Metric-based Root Cause Analysis [66.72542200701807]
PyRCA is an open-source machine learning library of Root Cause Analysis (RCA) for Artificial Intelligence for IT Operations (AIOps) It provides a holistic framework to uncover the complicated metric causal dependencies and automatically locate root causes of incidents.
arXiv Detail & Related papers (2023-06-20T09:55:10Z)
A Mechanistic Interpretation of Arithmetic Reasoning in Language Models using Causal Mediation Analysis [128.0532113800092]
We present a mechanistic interpretation of Transformer-based LMs on arithmetic questions. This provides insights into how information related to arithmetic is processed by LMs.
arXiv Detail & Related papers (2023-05-24T11:43:47Z)
Metric Tools for Sensitivity Analysis with Applications to Neural Networks [0.0]
Explainable Artificial Intelligence (XAI) aims to provide interpretations for predictions made by Machine Learning models. In this paper, a theoretical framework is proposed to study sensitivities of ML models using metric techniques. A complete family of new quantitative metrics called $alpha$-curves is extracted.
arXiv Detail & Related papers (2023-05-03T18:10:21Z)
Generalization Properties of Retrieval-based Models [50.35325326050263]
Retrieval-based machine learning methods have enjoyed success on a wide range of problems. Despite growing literature showcasing the promise of these models, the theoretical underpinning for such models remains underexplored. We present a formal treatment of retrieval-based models to characterize their generalization ability.
arXiv Detail & Related papers (2022-10-06T00:33:01Z)
Gone Fishing: Neural Active Learning with Fisher Embeddings [55.08537975896764]
There is an increasing need for active learning algorithms that are compatible with deep neural networks. This article introduces BAIT, a practical representation of tractable, and high-performing active learning algorithm for neural networks.
arXiv Detail & Related papers (2021-06-17T17:26:31Z)
Comparative Code Structure Analysis using Deep Learning for Performance Prediction [18.226950022938954]
This paper aims to assess the feasibility of using purely static information (e.g., abstract syntax tree or AST) of applications to predict performance change based on the change in code structure. Our evaluations of several deep embedding learning methods demonstrate that tree-based Long Short-Term Memory (LSTM) models can leverage the hierarchical structure of source-code to discover latent representations and achieve up to 84% (individual problem) and 73% (combined dataset with multiple of problems) accuracy in predicting the change in performance.
arXiv Detail & Related papers (2021-02-12T16:59:12Z)
Exploring Software Naturalness through Neural Language Models [56.1315223210742]
The Software Naturalness hypothesis argues that programming languages can be understood through the same techniques used in natural language processing. We explore this hypothesis through the use of a pre-trained transformer-based language model to perform code analysis tasks.
arXiv Detail & Related papers (2020-06-22T21:56:14Z)
Rethinking Generalization of Neural Models: A Named Entity Recognition Case Study [81.11161697133095]
We take the NER task as a testbed to analyze the generalization behavior of existing models from different perspectives. Experiments with in-depth analyses diagnose the bottleneck of existing neural NER models. As a by-product of this paper, we have open-sourced a project that involves a comprehensive summary of recent NER papers.
arXiv Detail & Related papers (2020-01-12T04:33:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.