Symbol Preference Aware Generative Models for Recovering Variable Names from Stripped Binary
- URL: http://arxiv.org/abs/2306.02546v4
- Date: Mon, 09 Dec 2024 06:45:17 GMT
- Title: Symbol Preference Aware Generative Models for Recovering Variable Names from Stripped Binary
- Authors: Xiangzhe Xu, Zhuo Zhang, Zian Su, Ziyang Huang, Shiwei Feng, Yapeng Ye, Nan Jiang, Danning Xie, Siyuan Cheng, Lin Tan, Xiangyu Zhang,
- Abstract summary: A prominent challenge in decompilation is to recover variable names.<n>We propose a novel technique that leverages the strengths of generative models while mitigating model biases.<n>We build a prototype, GenNm, from pre-trained generative models CodeGemma-2B, CodeLlama-7B, and CodeLlama-34B.
- Score: 18.05110624825475
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Decompilation aims to recover the source code form of a binary executable. It has many security applications, such as malware analysis, vulnerability detection, and code hardening. A prominent challenge in decompilation is to recover variable names. We propose a novel technique that leverages the strengths of generative models while mitigating model biases. We build a prototype, GenNm, from pre-trained generative models CodeGemma-2B, CodeLlama-7B, and CodeLlama-34B. We finetune GenNm on decompiled functions and teach models to leverage contextual information. GenNm includes names from callers and callees while querying a function, providing rich contextual information within the model's input token limitation. We mitigate model biases by aligning the output distribution of models with symbol preferences of developers. Our results show that GenNm improves the state-of-the-art name recovery precision by 5.6-11.4 percentage points on two commonly used datasets and improves the state-of-the-art by 32% (from 17.3% to 22.8%) in the most challenging setup where ground-truth variable names are not seen in the training dataset.
Related papers
- Robust Learning of Diverse Code Edits [10.565439872488328]
Software engineering activities frequently involve edits to existing code.
Code language models (LMs) lack the ability to handle diverse types of code-edit requirements.
arXiv Detail & Related papers (2025-03-05T16:39:04Z) - ReF Decompile: Relabeling and Function Call Enhanced Decompile [50.86228893636785]
The goal of decompilation is to convert compiled low-level code (e.g., assembly code) back into high-level programming languages.
This task supports various reverse engineering applications, such as vulnerability identification, malware analysis, and legacy software migration.
arXiv Detail & Related papers (2025-02-17T12:38:57Z) - HexaCoder: Secure Code Generation via Oracle-Guided Synthetic Training Data [60.75578581719921]
Large language models (LLMs) have shown great potential for automatic code generation.
Recent studies highlight that many LLM-generated code contains serious security vulnerabilities.
We introduce HexaCoder, a novel approach to enhance the ability of LLMs to generate secure codes.
arXiv Detail & Related papers (2024-09-10T12:01:43Z) - REVS: Unlearning Sensitive Information in Language Models via Rank Editing in the Vocabulary Space [35.61862064581971]
Large language models (LLMs) risk inadvertently memorizing and divulging sensitive or personally identifiable information (PII) seen in training data.
We propose REVS, a novel model editing method for unlearning sensitive information from LLMs.
arXiv Detail & Related papers (2024-06-13T17:02:32Z) - Does Your Neural Code Completion Model Use My Code? A Membership Inference Approach [66.51005288743153]
We investigate the legal and ethical issues of current neural code completion models.
We tailor a membership inference approach (termed CodeMI) that was originally crafted for classification tasks.
We evaluate the effectiveness of this adapted approach across a diverse array of neural code completion models.
arXiv Detail & Related papers (2024-04-22T15:54:53Z) - GenCode: A Generic Data Augmentation Framework for Boosting Deep Learning-Based Code Understanding [28.02426812004216]
We introduce a generic data augmentation framework, GenCode, to enhance the training of code understanding models.
To evaluate the effectiveness of GenCode, we conduct experiments on four code understanding tasks and three pre-trained code models.
Compared to the state-of-the-art (SOTA) code augmentation method, MixCode, GenCode produces code models with 2.92% higher accuracy and 4.90% robustness on average.
arXiv Detail & Related papers (2024-02-24T08:57:12Z) - SparseCoder: Identifier-Aware Sparse Transformer for File-Level Code
Summarization [51.67317895094664]
This paper studies file-level code summarization, which can assist programmers in understanding and maintaining large source code projects.
We propose SparseCoder, an identifier-aware sparse transformer for effectively handling long code sequences.
arXiv Detail & Related papers (2024-01-26T09:23:27Z) - Neuron Patching: Semantic-based Neuron-level Language Model Repair for Code Generation [32.178931149612644]
ulModel ulImprovement via ulNeuron ulTargeting (textscMINT) is a novel approach for repairing code Language Models (LMs)
textscMINT is effective, efficient, and reliable, capable of correcting a neural model by patching a minimum number of neurons.
arXiv Detail & Related papers (2023-12-08T20:28:08Z) - RefBERT: A Two-Stage Pre-trained Framework for Automatic Rename
Refactoring [57.8069006460087]
We study automatic rename on variable names, which is considered more challenging than other rename activities.
We propose RefBERT, a two-stage pre-trained framework for rename on variable names.
We show that the generated variable names of RefBERT are more accurate and meaningful than those produced by the existing method.
arXiv Detail & Related papers (2023-05-28T12:29:39Z) - Revisiting Deep Learning for Variable Type Recovery [3.075963833361584]
DIRTY is a Transformer-based-Decoder architecture capable of augmenting decompiled code with variable names and types.
We extend the original DIRTY results by re-training the DIRTY model on a dataset produced by the open-source Ghidra decompiler.
arXiv Detail & Related papers (2023-04-07T22:28:28Z) - Enhancing Multiple Reliability Measures via Nuisance-extended
Information Bottleneck [77.37409441129995]
In practical scenarios where training data is limited, many predictive signals in the data can be rather from some biases in data acquisition.
We consider an adversarial threat model under a mutual information constraint to cover a wider class of perturbations in training.
We propose an autoencoder-based training to implement the objective, as well as practical encoder designs to facilitate the proposed hybrid discriminative-generative training.
arXiv Detail & Related papers (2023-03-24T16:03:21Z) - Pseudo Label-Guided Model Inversion Attack via Conditional Generative
Adversarial Network [102.21368201494909]
Model inversion (MI) attacks have raised increasing concerns about privacy.
Recent MI attacks leverage a generative adversarial network (GAN) as an image prior to narrow the search space.
We propose Pseudo Label-Guided MI (PLG-MI) attack via conditional GAN (cGAN)
arXiv Detail & Related papers (2023-02-20T07:29:34Z) - Inflected Forms Are Redundant in Question Generation Models [27.49894653349779]
We propose an approach to enhance the performance of Question Generation using an encoder-decoder framework.
Firstly, we identify the inflected forms of words from the input of encoder, and replace them with the root words.
Secondly, we propose to adapt QG as a combination of the following actions in the encode-decoder framework: generating a question word, copying a word from the source sequence or generating a word transformation type.
arXiv Detail & Related papers (2023-01-01T13:08:11Z) - ReCode: Robustness Evaluation of Code Generation Models [90.10436771217243]
We propose ReCode, a comprehensive robustness evaluation benchmark for code generation models.
We customize over 30 transformations specifically for code on docstrings, function and variable names, code syntax, and code format.
With human annotators, we verified that over 90% of the perturbed prompts do not alter the semantic meaning of the original prompt.
arXiv Detail & Related papers (2022-12-20T14:11:31Z) - How Important are Good Method Names in Neural Code Generation? A Model
Robustness Perspective [14.453427809903424]
We study and demonstrate the potential of benefiting from method names to enhance the performance of PCGMs.
We propose a novel approach, named RADAR (neuRAl coDe generAtor Robustifier)
RADAR-Attack can reduce the CodeBLEU of generated code by 19.72% to 38.74% in three state-of-the-art PCGMs.
arXiv Detail & Related papers (2022-11-29T00:37:35Z) - DapStep: Deep Assignee Prediction for Stack Trace Error rePresentation [61.99379022383108]
We propose new deep learning models to solve the bug triage problem.
The models are based on a bidirectional recurrent neural network with attention and on a convolutional neural network.
To improve the quality of ranking, we propose using additional information from version control system annotations.
arXiv Detail & Related papers (2022-01-14T00:16:57Z) - Variable Name Recovery in Decompiled Binary Code using Constrained
Masked Language Modeling [17.377157455292817]
Decompilation is the procedure of transforming binary programs into a high-level representation, such as source code, for human analysts to examine.
We propose a novel solution to infer variable names in decompiled code based on Masked Language Modeling, Byte-Pair.
We show that our trained VarBERT model can predict variable names identical to the ones present in the original source code up to 84.15% of the time.
arXiv Detail & Related papers (2021-03-23T19:09:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.