PEM: Representing Binary Program Semantics for Similarity Analysis via a
Probabilistic Execution Model
- URL: http://arxiv.org/abs/2308.15449v2
- Date: Wed, 30 Aug 2023 01:57:23 GMT
- Title: PEM: Representing Binary Program Semantics for Similarity Analysis via a
Probabilistic Execution Model
- Authors: Xiangzhe Xu, Zhou Xuan, Shiwei Feng, Siyuan Cheng, Yapeng Ye, Qingkai
Shi, Guanhong Tao, Le Yu, Zhuo Zhang, and Xiangyu Zhang
- Abstract summary: We propose a new method to represent binary program semantics.
It is based on a novel probabilistic execution engine that can effectively sample the input space and the program path space of subject binaries.
Our evaluation on 9 real-world projects with 35k functions, and comparison with 6 state-of-the-art techniques show that PEM can achieve a precision of 96% with common settings.
- Score: 25.014876893315208
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Binary similarity analysis determines if two binary executables are from the
same source program. Existing techniques leverage static and dynamic program
features and may utilize advanced Deep Learning techniques. Although they have
demonstrated great potential, the community believes that a more effective
representation of program semantics can further improve similarity analysis. In
this paper, we propose a new method to represent binary program semantics. It
is based on a novel probabilistic execution engine that can effectively sample
the input space and the program path space of subject binaries. More
importantly, it ensures that the collected samples are comparable across
binaries, addressing the substantial variations of input specifications. Our
evaluation on 9 real-world projects with 35k functions, and comparison with 6
state-of-the-art techniques show that PEM can achieve a precision of 96% with
common settings, outperforming the baselines by 10-20%.
Related papers
- TRIAD: Automated Traceability Recovery based on Biterm-enhanced
Deduction of Transitive Links among Artifacts [53.92293118080274]
Traceability allows stakeholders to extract and comprehend the trace links among software artifacts introduced across the software life cycle.
Most rely on textual similarities among software artifacts, such as those based on Information Retrieval (IR)
arXiv Detail & Related papers (2023-12-28T06:44:24Z) - Improved Tree Search for Automatic Program Synthesis [91.3755431537592]
A key element is being able to perform an efficient search in the space of valid programs.
Here, we suggest a variant of MCTS that leads to state of the art results on two vastly different DSLs.
arXiv Detail & Related papers (2023-03-13T15:09:52Z) - AdaBin: Improving Binary Neural Networks with Adaptive Binary Sets [27.022212653067367]
This paper studies the Binary Neural Networks (BNNs) in which weights and activations are both binarized into 1-bit values.
We present a simple yet effective approach called AdaBin to adaptively obtain the optimal binary sets.
Experimental results on benchmark models and datasets demonstrate that the proposed AdaBin is able to achieve state-of-the-art performance.
arXiv Detail & Related papers (2022-08-17T05:43:33Z) - MAGPIE: Machine Automated General Performance Improvement via Evolution
of Software [19.188864062289433]
MAGPIE is a unified software improvement framework.
It provides a common edit sequence based representation that isolates the search process from the specific improvement technique.
arXiv Detail & Related papers (2022-08-04T17:58:43Z) - Learning from Self-Sampled Correct and Partially-Correct Programs [96.66452896657991]
We propose to let the model perform sampling during training and learn from both self-sampled fully-correct programs and partially-correct programs.
We show that our use of self-sampled correct and partially-correct programs can benefit learning and help guide the sampling process.
Our proposed method improves the pass@k performance by 3.1% to 12.3% compared to learning from a single reference program with MLE.
arXiv Detail & Related papers (2022-05-28T03:31:07Z) - Natural Language to Code Translation with Execution [82.52142893010563]
Execution result--minimum Bayes risk decoding for program selection.
We show that it improves the few-shot performance of pretrained code models on natural-language-to-code tasks.
arXiv Detail & Related papers (2022-04-25T06:06:08Z) - Enforcing Consistency in Weakly Supervised Semantic Parsing [68.2211621631765]
We explore the use of consistency between the output programs for related inputs to reduce the impact of spurious programs.
We find that a more consistent formalism leads to improved model performance even without consistency-based training.
arXiv Detail & Related papers (2021-07-13T03:48:04Z) - Semantic-aware Binary Code Representation with BERT [27.908093567605484]
A wide range of binary analysis applications, such as bug discovery, malware analysis and code clone detection, require recovery of contextual meanings on a binary code.
Recently, binary analysis techniques based on machine learning have been proposed to automatically reconstruct the code representation of a binary.
In this paper, we propose DeepSemantic utilizing BERT in producing the semantic-aware code representation of a binary code.
arXiv Detail & Related papers (2021-06-10T03:31:29Z) - Representing Partial Programs with Blended Abstract Semantics [62.20775388513027]
We introduce a technique for representing partially written programs in a program synthesis engine.
We learn an approximate execution model implemented as a modular neural network.
We show that these hybrid neuro-symbolic representations enable execution-guided synthesizers to use more powerful language constructs.
arXiv Detail & Related papers (2020-12-23T20:40:18Z) - Improving the Effectiveness of Traceability Link Recovery using
Hierarchical Bayesian Networks [21.15456830607455]
We implement a HierarchiCal PrObabilistic Model for SoftwarE Traceability (Comet)
Comet is capable of modeling relationships between artifacts by combining the complementary observational prowess of multiple measures of textual similarity.
We conduct a comprehensive empirical evaluation of Comet that illustrates an improvement over a set of optimally configured baselines.
arXiv Detail & Related papers (2020-05-18T19:38:29Z) - Bin2vec: Learning Representations of Binary Executable Programs for
Security Tasks [15.780176500971244]
We introduce Bin2vec, a new approach leveraging Graph Convolutional Networks (GCN) along with computational program graphs.
We demonstrate the versatility of this approach by using our representations to solve two semantically different binary analysis tasks.
We set a new state-of-the-art result by reducing the classification error by 40% compared to the source-code-based inst2vec approach.
arXiv Detail & Related papers (2020-02-09T15:46:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.