Can We Generate Shellcodes via Natural Language? An Empirical Study
- URL: http://arxiv.org/abs/2202.03755v1
- Date: Tue, 8 Feb 2022 09:57:34 GMT
- Title: Can We Generate Shellcodes via Natural Language? An Empirical Study
- Authors: Pietro Liguori, Erfan Al-Hossami, Domenico Cotroneo, Roberto Natella,
Bojan Cukic, Samira Shaikh
- Abstract summary: We propose an approach based on Neural Machine Translation to automatically generate shellcodes.
Shellcode_IA32 consists of 3,200 assembly code snippets of real Linux/x86 shellcodes annotated using natural language.
We show that NMT can generate assembly code snippets from the natural language with high accuracy and that in many cases can generate entire shellcodes with no errors.
- Score: 4.82810058837951
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Writing software exploits is an important practice for offensive security
analysts to investigate and prevent attacks. In particular, shellcodes are
especially time-consuming and a technical challenge, as they are written in
assembly language. In this work, we address the task of automatically
generating shellcodes, starting purely from descriptions in natural language,
by proposing an approach based on Neural Machine Translation (NMT). We then
present an empirical study using a novel dataset (Shellcode_IA32), which
consists of 3,200 assembly code snippets of real Linux/x86 shellcodes from
public databases, annotated using natural language. Moreover, we propose novel
metrics to evaluate the accuracy of NMT at generating shellcodes. The empirical
analysis shows that NMT can generate assembly code snippets from the natural
language with high accuracy and that in many cases can generate entire
shellcodes with no errors.
Related papers
- NoviCode: Generating Programs from Natural Language Utterances by Novices [59.71218039095155]
We present NoviCode, a novel NL Programming task which takes as input an API and a natural language description by a novice non-programmer.
We show that NoviCode is indeed a challenging task in the code synthesis domain, and that generating complex code from non-technical instructions goes beyond the current Text-to-Code paradigm.
arXiv Detail & Related papers (2024-07-15T11:26:03Z) - Synthetic Programming Elicitation for Text-to-Code in Very Low-Resource Programming and Formal Languages [21.18996339478024]
We introduce emphsynthetic programming elicitation and compilation (SPEAC)
SPEAC produces syntactically correct programs more frequently and without sacrificing semantic correctness.
We empirically evaluate the performance of SPEAC in a case study for the UCLID5 formal verification language.
arXiv Detail & Related papers (2024-06-05T22:16:19Z) - Decoding at the Speed of Thought: Harnessing Parallel Decoding of Lexical Units for LLMs [57.27982780697922]
Large language models have demonstrated exceptional capability in natural language understanding and generation.
However, their generation speed is limited by the inherently sequential nature of their decoding process.
This paper introduces Lexical Unit Decoding, a novel decoding methodology implemented in a data-driven manner.
arXiv Detail & Related papers (2024-05-24T04:35:13Z) - CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation [58.84212778960507]
We propose CodeGRAG, a Graphical Retrieval Augmented Code Generation framework to enhance the performance of LLMs.
CodeGRAG builds the graphical view of code blocks based on the control flow and data flow of them to fill the gap between programming languages and natural language.
Various experiments and ablations are done on four datasets including both the C++ and python languages to validate the hard meta-graph prompt, the soft prompting technique, and the effectiveness of the objectives for pretrained GNN expert.
arXiv Detail & Related papers (2024-05-03T02:48:55Z) - Guess & Sketch: Language Model Guided Transpilation [59.02147255276078]
Learned transpilation offers an alternative to manual re-writing and engineering efforts.
Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness.
Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence.
arXiv Detail & Related papers (2023-09-25T15:42:18Z) - Interactive Code Generation via Test-Driven User-Intent Formalization [60.90035204567797]
Large language models (LLMs) produce code from informal natural language (NL) intent.
It is hard to define a notion of correctness since natural language can be ambiguous and lacks a formal semantics.
We describe a language-agnostic abstract algorithm and a concrete implementation TiCoder.
arXiv Detail & Related papers (2022-08-11T17:41:08Z) - MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages [76.93265104421559]
We benchmark code generation from natural language commands extending beyond English.
We annotated a total of 896 NL-code pairs in three languages: Spanish, Japanese, and Russian.
While the difficulties vary across these three languages, all systems lag significantly behind their English counterparts.
arXiv Detail & Related papers (2022-03-16T04:21:50Z) - Synchromesh: Reliable code generation from pre-trained language models [38.15391794443022]
We propose Synchromesh: a framework for substantially improving the reliability of pre-trained models for code generation.
First, it retrieves few-shot examples from a training bank using Target Similarity Tuning (TST), a novel method for semantic example selection.
Then, Synchromesh feeds the examples to a pre-trained language model and samples programs using Constrained Semantic Decoding (CSD), a general framework for constraining the output to a set of valid programs in the target language.
arXiv Detail & Related papers (2022-01-26T22:57:44Z) - Shellcode_IA32: A Dataset for Automatic Shellcode Generation [2.609784101826762]
We take the first step to address the task of automatically generating shellcodes, i.e., small pieces of code used as a payload in the exploitation of a software vulnerability.
We assemble and release a novel dataset (Shellcode_IA32), consisting of challenging but common assembly instructions with their natural language descriptions.
arXiv Detail & Related papers (2021-04-27T10:50:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.