BinSym: Binary-Level Symbolic Execution using Formal Descriptions of Instruction Semantics
- URL: http://arxiv.org/abs/2404.04132v1
- Date: Fri, 5 Apr 2024 14:29:39 GMT
- Title: BinSym: Binary-Level Symbolic Execution using Formal Descriptions of Instruction Semantics
- Authors: Sören Tempel, Tobias Brandt, Christoph Lüth, Rolf Drechsler,
- Abstract summary: BinSym is a framework for symbolic program analysis of software in binary form.
It operates directly on binary code instructions and does not require lifting them to an intermediate representation.
- Score: 2.4576576560952788
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: BinSym is a framework for symbolic program analysis of software in binary form. Contrary to prior work, it operates directly on binary code instructions and does not require lifting them to an intermediate representation (IR). This is achieved by formulating the symbolic semantics on top of a formal description of binary code instruction semantics. By building on existing formal descriptions, BinSym eliminates the manual effort required by prior work to implement transformations to an IR, thereby reducing the margin for errors. Furthermore, BinSym's symbolic semantics can be directly related to the binary code, which improves symbolic execution speed by reducing solver query complexity.
Related papers
- An Empirical Study on the Effectiveness of Large Language Models for Binary Code Understanding [50.17907898478795]
This work proposes a benchmark to evaluate the effectiveness of Large Language Models (LLMs) in real-world reverse engineering scenarios.
Our evaluations reveal that existing LLMs can understand binary code to a certain extent, thereby improving the efficiency of binary code analysis.
arXiv Detail & Related papers (2025-04-30T17:02:06Z) - StrTune: Data Dependence-based Code Slicing for Binary Similarity Detection with Fine-tuned Representation [5.41477941455399]
BCSD can address binary tasks such as malicious code snippets identification and binary patch analysis by comparing code patterns.
Because binaries are compiled with different compilation configurations, existing approaches still face notable limitations when comparing binary similarity.
We propose StrTune, which slices binary code based on data dependence and perform slice-level fine-tuning.
arXiv Detail & Related papers (2024-11-19T12:20:08Z) - Training Neural Networks as Recognizers of Formal Languages [87.06906286950438]
We train and evaluate neural networks directly as binary classifiers of strings.
We provide results on a variety of languages across the Chomsky hierarchy for three neural architectures.
Our contributions will facilitate theoretically sound empirical testing of language recognition claims in future work.
arXiv Detail & Related papers (2024-11-11T16:33:25Z) - Unsupervised Binary Code Translation with Application to Code Similarity Detection and Vulnerability Discovery [2.022692275087205]
Cross-architecture binary code analysis has become an emerging problem.
Deep learning-based binary analysis has shown promising success.
For some low-resource ISAs, an adequate amount of data is hard to find.
arXiv Detail & Related papers (2024-04-29T18:09:28Z) - Symbol-LLM: Leverage Language Models for Symbolic System in Visual Human
Activity Reasoning [58.5857133154749]
We propose a new symbolic system with broad-coverage symbols and rational rules.
We leverage the recent advancement of LLMs as an approximation of the two ideal properties.
Our method shows superiority in extensive activity understanding tasks.
arXiv Detail & Related papers (2023-11-29T05:27:14Z) - CP-BCS: Binary Code Summarization Guided by Control Flow Graph and
Pseudo Code [79.87518649544405]
We present a control flow graph and pseudo code guided binary code summarization framework called CP-BCS.
CP-BCS utilizes a bidirectional instruction-level control flow graph and pseudo code that incorporates expert knowledge to learn the comprehensive binary function execution behavior and logic semantics.
arXiv Detail & Related papers (2023-10-24T14:20:39Z) - Guess & Sketch: Language Model Guided Transpilation [59.02147255276078]
Learned transpilation offers an alternative to manual re-writing and engineering efforts.
Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness.
Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence.
arXiv Detail & Related papers (2023-09-25T15:42:18Z) - Code Representation Pre-training with Complements from Program
Executions [29.148208436656216]
We propose FuzzPretrain to explore the dynamic information of programs revealed by their test cases and embed it into the feature representations of code as complements.
FuzzyPretrain yielded more than 6%/9% mAP improvements on code search over its counterparts trained with only source code or AST.
arXiv Detail & Related papers (2023-09-04T01:57:22Z) - Code Prompting: a Neural Symbolic Method for Complex Reasoning in Large
Language Models [74.95486528482327]
We explore code prompting, a neural symbolic prompting method with both zero-shot and few-shot versions which triggers code as intermediate steps.
We conduct experiments on 7 widely-used benchmarks involving symbolic reasoning and arithmetic reasoning.
arXiv Detail & Related papers (2023-05-29T15:14:09Z) - ProgSG: Cross-Modality Representation Learning for Programs in
Electronic Design Automation [38.023395256208055]
High-level synthesis (HLS) allows a developer to compile a high-level description in the form of software code in C and C++.
HLS tools still require microarchitecture decisions, expressed in terms of pragmas.
We propose ProgSG allowing the source code sequence modality and the graph modalities to interact with each other in a deep and fine-grained way.
arXiv Detail & Related papers (2023-05-18T09:44:18Z) - Soft-Labeled Contrastive Pre-training for Function-level Code
Representation [127.71430696347174]
We present textbfSCodeR, a textbfSoft-labeled contrastive pre-training framework with two positive sample construction methods.
Considering the relevance between codes in a large-scale code corpus, the soft-labeled contrastive pre-training can obtain fine-grained soft-labels.
SCodeR achieves new state-of-the-art performance on four code-related tasks over seven datasets.
arXiv Detail & Related papers (2022-10-18T05:17:37Z) - Pre-Training Representations of Binary Code Using Contrastive Learning [13.570375923483452]
We propose a COntrastive learning Model for Binary cOde Analysis, or COMBO, that incorporates source code and comment information into binary code during representation learning.
COMBO is the first language representation model that incorporates source code, binary code, and comments into contrastive code representation learning.
arXiv Detail & Related papers (2022-10-11T02:39:06Z) - BinBert: Binary Code Understanding with a Fine-tunable and
Execution-aware Transformer [2.8523943706562638]
In this paper we present BinBert, a novel assembly code model.
BinBert is built on a transformer pre-trained on a huge dataset of both assembly instruction sequences and symbolic execution information.
Through fine-tuning, BinBert learns how to apply the general knowledge acquired with pre-training to the specific task.
arXiv Detail & Related papers (2022-08-13T17:48:52Z) - Linear Temporal Logic Modulo Theories over Finite Traces (Extended
Version) [72.38188258853155]
This paper studies Linear Temporal Logic over Finite Traces (LTLf)
proposition letters are replaced with first-order formulas interpreted over arbitrary theories.
The resulting logic, called Satisfiability Modulo Theories (LTLfMT), is semi-decidable.
arXiv Detail & Related papers (2022-04-28T17:57:33Z) - A Natural Language Processing Approach for Instruction Set Architecture
Identification [6.495883501989546]
We introduce character-level features of encoded binaries to identify fine-grained bit patterns inherent to each ISA.
Our approach results in an 8% higher accuracy than the state-of-the-art features based on byte-histograms and byte pattern signatures.
arXiv Detail & Related papers (2022-04-13T19:45:06Z) - Semantic-aware Binary Code Representation with BERT [27.908093567605484]
A wide range of binary analysis applications, such as bug discovery, malware analysis and code clone detection, require recovery of contextual meanings on a binary code.
Recently, binary analysis techniques based on machine learning have been proposed to automatically reconstruct the code representation of a binary.
In this paper, we propose DeepSemantic utilizing BERT in producing the semantic-aware code representation of a binary code.
arXiv Detail & Related papers (2021-06-10T03:31:29Z) - How could Neural Networks understand Programs? [67.4217527949013]
It is difficult to build a model to better understand programs, by either directly applying off-the-shelf NLP pre-training techniques to the source code, or adding features to the model by theshelf.
We propose a novel program semantics learning paradigm, that the model should learn from information composed of (1) the representations which align well with the fundamental operations in operational semantics, and (2) the information of environment transition.
arXiv Detail & Related papers (2021-05-10T12:21:42Z) - High-performance symbolic-numerics via multiple dispatch [52.77024349608834]
Symbolics.jl is an extendable symbolic system which uses dynamic multiple dispatch to change behavior depending on the domain needs.
We show that by formalizing a generic API on actions independent of implementation, we can retroactively add optimized data structures to our system.
We demonstrate the ability to swap between classical term-rewriting simplifiers and e-graph-based term-rewriting simplifiers.
arXiv Detail & Related papers (2021-05-09T14:22:43Z) - Representing Partial Programs with Blended Abstract Semantics [62.20775388513027]
We introduce a technique for representing partially written programs in a program synthesis engine.
We learn an approximate execution model implemented as a modular neural network.
We show that these hybrid neuro-symbolic representations enable execution-guided synthesizers to use more powerful language constructs.
arXiv Detail & Related papers (2020-12-23T20:40:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.