Reverse Engineering Structure and Semantics of Input of a Binary Executable
- URL: http://arxiv.org/abs/2405.14052v1
- Date: Wed, 22 May 2024 22:47:33 GMT
- Title: Reverse Engineering Structure and Semantics of Input of a Binary Executable
- Authors: Seshagiri Prabhu Narasimha, Arun Lakhotia,
- Abstract summary: This paper presents an algorithm to recover the structure and semantic relations between fields of the input of binary executables.
The algorithm was implemented in a prototype system named ByteRI 2.0.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge of the input format of binary executables is important for finding bugs and vulnerabilities, such as generating data for fuzzing or manual reverse engineering. This paper presents an algorithm to recover the structure and semantic relations between fields of the input of binary executables using dynamic taint analysis. The algorithm improves upon prior work by not just partitioning the input into consecutive bytes representing values but also identifying syntactic components of structures, such as atomic fields of fixed and variable lengths, and different types of arrays, such as arrays of atomic fields, arrays of records, and arrays with variant records. It also infers the semantic relations between fields of a structure, such as count fields that specify the count of an array of records or offset fields that specify the start location of a variable-length field within the input data. The algorithm constructs a C/C++-like structure to represent the syntactic components and semantic relations. The algorithm was implemented in a prototype system named ByteRI 2.0. The system was evaluated using a controlled experiment with synthetic subject programs and real-world programs. The subject programs were created to accept a variety of input formats that mimic syntactic components and selected semantic relations found in conventional data formats, such as PE, PNG, ZIP, and CSV. The results show that ByteRI 2.0 correctly identifies the syntactic elements and their grammatical structure, as well as the semantic relations between the fields for both synthetic subject programs and real-world programs. The recovered structures, when used as a generator, produced valid data that was acceptable for all the synthetic subject programs and some of the real-world programs.
Related papers
- Inferring Attributed Grammars from Parser Implementations [1.0217990949413291]
We introduce a novel approach for inferring attributed grammars from implementations of input grammars.<n>By observing runtime executions and mapping the program's behavior to the grammar, we systematically extract and embed semantic actions into the grammar rules.<n>We demonstrate the feasibility of our approach using an initial set of programs, showing that it can accurately reproduce program behavior through the generated attributed grammars.
arXiv Detail & Related papers (2025-07-17T13:32:59Z) - Beyond the Edge of Function: Unraveling the Patterns of Type Recovery in Binary Code [55.493408628371235]
We propose ByteTR, a framework for recovering variable types in binary code.
In light of the ubiquity of variable propagation across functions, ByteTR conducts inter-procedural analysis to trace variable propagation and employs a gated graph neural network to capture long-range data flow dependencies for variable type recovery.
arXiv Detail & Related papers (2025-03-10T12:27:05Z) - StrTune: Data Dependence-based Code Slicing for Binary Similarity Detection with Fine-tuned Representation [5.41477941455399]
BCSD can address binary tasks such as malicious code snippets identification and binary patch analysis by comparing code patterns.
Because binaries are compiled with different compilation configurations, existing approaches still face notable limitations when comparing binary similarity.
We propose StrTune, which slices binary code based on data dependence and perform slice-level fine-tuning.
arXiv Detail & Related papers (2024-11-19T12:20:08Z) - Transformers are Efficient Compilers, Provably [11.459397066286822]
Transformer-based large language models (LLMs) have demonstrated surprisingly robust performance across a wide range of language-related tasks.
In this paper, we take the first steps towards a formal investigation of using transformers as compilers from an expressive power perspective.
We introduce a representative programming language, Mini-Husky, which encapsulates key features of modern C-like languages.
arXiv Detail & Related papers (2024-10-07T20:31:13Z) - Constructing a BPE Tokenization DFA [0.0]
We give and analyze an efficient construction of deterministic finite automata (DFA)<n>We show how to construct an input-deterministic (subsequential) string-to-string transducer which precisely describes the relationship between strings and their correct tokenizations.
arXiv Detail & Related papers (2024-05-13T11:59:24Z) - Compositional Program Generation for Few-Shot Systematic Generalization [59.57656559816271]
This study on a neuro-symbolic architecture called the Compositional Program Generator (CPG)
CPG has three key features: textitmodularity, textitcomposition, and textitabstraction, in the form of grammar rules.
It perfect achieves generalization on both the SCAN and COGS benchmarks using just 14 examples for SCAN and 22 examples for COGS.
arXiv Detail & Related papers (2023-09-28T14:33:20Z) - Improved Tree Search for Automatic Program Synthesis [91.3755431537592]
A key element is being able to perform an efficient search in the space of valid programs.
Here, we suggest a variant of MCTS that leads to state of the art results on two vastly different DSLs.
arXiv Detail & Related papers (2023-03-13T15:09:52Z) - A Divide-Align-Conquer Strategy for Program Synthesis [8.595181704811889]
We show that compositional segmentation can be applied in the programming by examples setting to divide the search for large programs across multiple smaller program synthesis problems.
A structural alignment of the constituent parts in the input and output leads to pairwise correspondences used to guide the program search.
arXiv Detail & Related papers (2023-01-08T19:10:55Z) - Source Code Summarization with Structural Relative Position Guided
Transformer [19.828300746504148]
Source code summarization aims at generating concise and clear natural language descriptions for programming languages.
Recent efforts focus on incorporating the syntax structure of code into neural networks such as Transformer.
We propose a Structural Relative Position guided Transformer, named SCRIPT.
arXiv Detail & Related papers (2022-02-14T07:34:33Z) - Latent Execution for Neural Program Synthesis Beyond Domain-Specific
Languages [97.58968222942173]
We take the first step to synthesize C programs from input-output examples.
In particular, we propose La Synth, which learns the latent representation to approximate the execution of partially generated programs.
We show that training on these synthesized programs further improves the prediction performance for both Karel and C program synthesis.
arXiv Detail & Related papers (2021-06-29T02:21:32Z) - Learning to Synthesize Data for Semantic Parsing [57.190817162674875]
We propose a generative model which models the composition of programs and maps a program to an utterance.
Due to the simplicity of PCFG and pre-trained BART, our generative model can be efficiently learned from existing data at hand.
We evaluate our method in both in-domain and out-of-domain settings of text-to-Query parsing on the standard benchmarks of GeoQuery and Spider.
arXiv Detail & Related papers (2021-04-12T21:24:02Z) - Syntactic representation learning for neural network based TTS with
syntactic parse tree traversal [49.05471750563229]
We propose a syntactic representation learning method based on syntactic parse tree to automatically utilize the syntactic structure information.
Experimental results demonstrate the effectiveness of our proposed approach.
For sentences with multiple syntactic parse trees, prosodic differences can be clearly perceived from the synthesized speeches.
arXiv Detail & Related papers (2020-12-13T05:52:07Z) - LogicalFactChecker: Leveraging Logical Operations for Fact Checking with
Graph Module Network [111.24773949467567]
We propose LogicalFactChecker, a neural network approach capable of leveraging logical operations for fact checking.
It achieves the state-of-the-art performance on TABFACT, a large-scale, benchmark dataset.
arXiv Detail & Related papers (2020-04-28T17:04:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.