Related papers: SCORE: Syntactic Code Representations for Static Script Malware Detection

SCORE: Syntactic Code Representations for Static Script Malware Detection

URL: http://arxiv.org/abs/2411.08182v1
Date: Tue, 12 Nov 2024 20:58:04 GMT
Title: SCORE: Syntactic Code Representations for Static Script Malware Detection
Authors: Ecenaz Erdemir, Kyuhong Park, Michael J. Morais, Vianne R. Gao, Marion Marschalek, Yi Fan,
Abstract summary: Server-side script attacks can steal data, compromise credentials, and disrupt operations. We propose novel feature extraction and deep learning (DL)-based approaches for static script malware detection. Our approach achieves a true positive rate (TPR) up to 81% higher than leading signature-based antivirus solutions.
Score: 9.502104012686491
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As businesses increasingly adopt cloud technologies, they also need to be aware of new security challenges, such as server-side script attacks, to ensure the integrity of their systems and data. These scripts can steal data, compromise credentials, and disrupt operations. Unlike executables with standardized formats (e.g., ELF, PE), scripts are plaintext files with diverse syntax, making them harder to detect using traditional methods. As a result, more sophisticated approaches are needed to protect cloud infrastructures from these evolving threats. In this paper, we propose novel feature extraction and deep learning (DL)-based approaches for static script malware detection, targeting server-side threats. We extract features from plain-text code using two techniques: syntactic code highlighting (SCH) and abstract syntax tree (AST) construction. SCH leverages complex regexes to parse syntactic elements of code, such as keywords, variable names, etc. ASTs generate a hierarchical representation of a program's syntactic structure. We then propose a sequential and a graph-based model that exploits these feature representations to detect script malware. We evaluate our approach on more than 400K server-side scripts in Bash, Python and Perl. We use a balanced dataset of 90K scripts for training, validation, and testing, with the remaining from 400K reserved for further analysis. Experiments show that our method achieves a true positive rate (TPR) up to 81% higher than leading signature-based antivirus solutions, while maintaining a low false positive rate (FPR) of 0.17%. Moreover, our approach outperforms various neural network-based detectors, demonstrating its effectiveness in learning code maliciousness for accurate detection of script malware.

Related papers

Breaking Obfuscation: Cluster-Aware Graph with LLM-Aided Recovery for Malicious JavaScript Detection [9.83040332336481]
Malicious JavaScript code poses significant threats to user privacy, system integrity, and enterprise security.<n>We propose DeCoda, a hybrid defense framework that combines large language model (LLM)-based deobfuscation with code graph learning.
arXiv Detail & Related papers (2025-07-30T07:46:49Z)
CASTLE: Benchmarking Dataset for Static Code Analyzers and LLMs towards CWE Detection [2.5228276786940182]
This paper introduces CASTLE, a benchmarking framework for evaluating the vulnerability detection capabilities of different methods. We assess 13 static analysis tools, 10 LLMs, and 2 formal verification tools using a hand-crafted dataset of 250 micro-benchmark programs covering 25 common CWEs.
arXiv Detail & Related papers (2025-03-12T14:30:05Z)
Secret Breach Prevention in Software Issue Reports [2.8747015994080285]
This paper presents a novel technique for secret breach detection in software issue reports. We highlight the challenges posed by noise, such as log files, URLs, commit IDs, stack traces, and dummy passwords. We propose an approach combining the strengths of state-of-the-artes with the contextual understanding of language models.
arXiv Detail & Related papers (2024-10-31T06:14:17Z)
Towards Novel Malicious Packet Recognition: A Few-Shot Learning Approach [0.0]
Deep Packet Inspection (DPI) has emerged as a key technology in strengthening network security. This study proposes a novel approach that leverages a large language model (LLM) and few-shot learning. Our approach shows promising results with an average accuracy of 86.35% and F1-Score of 86.40% on different malware types.
arXiv Detail & Related papers (2024-09-17T15:02:32Z)
Zero-Shot Detection of Machine-Generated Codes [83.0342513054389]
This work proposes a training-free approach for the detection of LLMs-generated codes. We find that existing training-based or zero-shot text detectors are ineffective in detecting code. Our method exhibits robustness against revision attacks and generalizes well to Java codes.
arXiv Detail & Related papers (2023-10-08T10:08:21Z)
Transformer-based Vulnerability Detection in Code at EditTime: Zero-shot, Few-shot, or Fine-tuning? [5.603751223376071]
We present a practical system that leverages deep learning on a large-scale data set of vulnerable code patterns. We show that in comparison with state of the art vulnerability detection models our approach improves the state of the art by 10%.
arXiv Detail & Related papers (2023-05-23T01:21:55Z)
Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense [56.077252790310176]
We present a paraphrase generation model (DIPPER) that can paraphrase paragraphs, condition on surrounding context, and control lexical diversity and content reordering. Using DIPPER to paraphrase text generated by three large language models (including GPT3.5-davinci-003) successfully evades several detectors, including watermarking. We introduce a simple defense that relies on retrieving semantically-similar generations and must be maintained by a language model API provider.
arXiv Detail & Related papers (2023-03-23T16:29:27Z)
Statement-Level Vulnerability Detection: Learning Vulnerability Patterns Through Information Theory and Contrastive Learning [31.15123852246431]
We propose a novel end-to-end deep learning-based approach to identify the vulnerability-relevant code statements of a specific function. Inspired by the structures observed in real-world vulnerable code, we first leverage mutual information for learning a set of latent variables. We then propose novel clustered spatial contrastive learning in order to further improve the representation learning.
arXiv Detail & Related papers (2022-09-20T00:46:20Z)
VUDENC: Vulnerability Detection with Deep Learning on a Natural Codebase for Python [8.810543294798485]
VUDENC is a deep learning-based vulnerability detection tool. It learns features of vulnerable code from a large and real-world Python corpus. VUDENC achieves a recall of 78%-87%, a precision of 82%-96%, and an F1 score of 80%-90%.
arXiv Detail & Related papers (2022-01-20T20:29:22Z)
VELVET: a noVel Ensemble Learning approach to automatically locate VulnErable sTatements [62.93814803258067]
This paper presents VELVET, a novel ensemble learning approach to locate vulnerable statements in source code. Our model combines graph-based and sequence-based neural networks to successfully capture the local and global context of a program graph. VELVET achieves 99.6% and 43.6% top-1 accuracy over synthetic data and real-world data, respectively.
arXiv Detail & Related papers (2021-12-20T22:45:27Z)
Software Vulnerability Detection via Deep Learning over Disaggregated Code Graph Representation [57.92972327649165]
This work explores a deep learning approach to automatically learn the insecure patterns from code corpora. Because code naturally admits graph structures with parsing, we develop a novel graph neural network (GNN) to exploit both the semantic context and structural regularity of a program.
arXiv Detail & Related papers (2021-09-07T21:24:36Z)
Adversarial EXEmples: A Survey and Experimental Evaluation of Practical Attacks on Machine Learning for Windows Malware Detection [67.53296659361598]
adversarial EXEmples can bypass machine learning-based detection by perturbing relatively few input bytes. We develop a unifying framework that does not only encompass and generalize previous attacks against machine-learning models, but also includes three novel attacks. These attacks, named Full DOS, Extend and Shift, inject the adversarial payload by respectively manipulating the DOS header, extending it, and shifting the content of the first section.
arXiv Detail & Related papers (2020-08-17T07:16:57Z)
Contrastive Code Representation Learning [95.86686147053958]
We show that the popular reconstruction-based BERT model is sensitive to source code edits, even when the edits preserve semantics. We propose ContraCode: a contrastive pre-training task that learns code functionality, not form.
arXiv Detail & Related papers (2020-07-09T17:59:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.