Related papers: Levels of Binary Equivalence for the Comparison of Binaries from Alternative Builds

Levels of Binary Equivalence for the Comparison of Binaries from Alternative Builds

URL: http://arxiv.org/abs/2410.08427v1
Date: Fri, 11 Oct 2024 00:16:26 GMT
Title: Levels of Binary Equivalence for the Comparison of Binaries from Alternative Builds
Authors: Jens Dietrich, Tim White, Behnaz Hassanshahi, Paddy Krishnan,
Abstract summary: Build platform variability can strengthen security as it facilitates the detection of compromised build environments. The availability of multiple binaries built from the same sources creates new challenges and opportunities. To answer such questions requires a notion of equivalence between binaries.
Score: 1.1405827621489222
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: In response to challenges in software supply chain security, several organisations have created infrastructures to independently build commodity open source projects and release the resulting binaries. Build platform variability can strengthen security as it facilitates the detection of compromised build environments. Furthermore, by improving the security posture of the build platform and collecting provenance information during the build, the resulting artifacts can be used with greater trust. Such offerings are now available from Google, Oracle and RedHat. The availability of multiple binaries built from the same sources creates new challenges and opportunities, and raises questions such as: 'Does build A confirm the integrity of build B?' or 'Can build A reveal a compromised build B?'. To answer such questions requires a notion of equivalence between binaries. We demonstrate that the obvious approach based on bitwise equality has significant shortcomings in practice, and that there is value in opting for alternative notions. We conceptualise this by introducing levels of equivalence, inspired by clone detection types. We demonstrate the value of these new levels through several experiments. We construct a dataset consisting of Java binaries built from the same sources independently by different providers, resulting in 14,156 pairs of binaries in total. We then compare the compiled class files in those jar files and find that for 3,750 pairs of jars (26.49%) there is at least one such file that is different, also forcing the jar files and their cryptographic hashes to be different. However, based on the new equivalence levels, we can still establish that many of them are practically equivalent. We evaluate several candidate equivalence relations on a semi-synthetic dataset that provides oracles consisting of pairs of binaries that either should be, or must not be equivalent.

Related papers

ReF Decompile: Relabeling and Function Call Enhanced Decompile [50.86228893636785]
The goal of decompilation is to convert compiled low-level code (e.g., assembly code) back into high-level programming languages. This task supports various reverse engineering applications, such as vulnerability identification, malware analysis, and legacy software migration.
arXiv Detail & Related papers (2025-02-17T12:38:57Z)
Skeleton-Guided-Translation: A Benchmarking Framework for Code Repository Translation with Fine-Grained Quality Evaluation [37.25839260805938]
Skeleton-Guided-Translation is a framework for repository-level Java to C# code translation with fine-grained quality evaluation. We present TRANSREPO-BENCH, a benchmark of high quality open-source Java repositories and their corresponding C# skeletons.
arXiv Detail & Related papers (2025-01-27T13:44:51Z)
Binary Code Similarity Detection via Graph Contrastive Learning on Intermediate Representations [52.34030226129628]
Binary Code Similarity Detection (BCSD) plays a crucial role in numerous fields, including vulnerability detection, malware analysis, and code reuse identification. In this paper, we propose IRBinDiff, which mitigates compilation differences by leveraging LLVM-IR with higher-level semantic abstraction. Our extensive experiments, conducted under varied compilation settings, demonstrate that IRBinDiff outperforms other leading BCSD methods in both One-to-one comparison and One-to-many search scenarios.
arXiv Detail & Related papers (2024-10-24T09:09:20Z)
BinSimDB: Benchmark Dataset Construction for Fine-Grained Binary Code Similarity Analysis [6.093226756571566]
We construct a benchmark dataset for fine-grained binary code similarity analysis called BinSimDB. Specifically, we propose BMerge and BPair algorithms to bridge the discrepancies between two binary code snippets. The experimental results demonstrate that BinSimDB significantly improves the performance of binary code similarity comparison.
arXiv Detail & Related papers (2024-10-14T05:13:48Z)
Assemblage: Automatic Binary Dataset Construction for Machine Learning [35.674339346299654]
Assemblage is a cloud-based distributed system that crawls, configures, and builds Windows PE binaries. We have run Assemblage on AWS over the past year, producing 890k Windows PE and 428k Linux ELF binaries across 29 configurations.
arXiv Detail & Related papers (2024-05-07T04:10:01Z)
Advanced Detection of Source Code Clones via an Ensemble of Unsupervised Similarity Measures [0.0]
This research introduces a novel ensemble learning approach for code similarity assessment. The key idea is that the strengths of a diverse set of similarity measures can complement each other and mitigate individual weaknesses.
arXiv Detail & Related papers (2024-05-03T13:42:49Z)
BinGo: Identifying Security Patches in Binary Code with Graph Representation Learning [19.22004583230725]
We propose BinGo, a new security patch detection system for binary code. BinGo consists of four phases, namely, patch data pre-processing, graph extraction, embedding generation, and graph representation learning. Our experimental results show BinGo can achieve up to 80.77% accuracy in identifying security patches between two neighboring versions of binary code.
arXiv Detail & Related papers (2023-12-13T06:35:39Z)
CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion [86.01508183157613]
CrossCodeEval is built on a diverse set of real-world, open-sourced, permissively-licensed repositories in four popular programming languages. We show that CrossCodeEval is extremely challenging when the relevant cross-file context is absent. We also show that CrossCodeEval can also be used to measure the capability of code retrievers.
arXiv Detail & Related papers (2023-10-17T13:18:01Z)
On the Security Blind Spots of Software Composition Analysis [46.1389163921338]
We present a novel approach to detect vulnerable clones in the Maven repository. We retrieve over 53k potential vulnerable clones from Maven Central. We detect 727 confirmed vulnerable clones and synthesize a testable proof-of-vulnerability project for each of those.
arXiv Detail & Related papers (2023-06-08T20:14:46Z)
Towards Accurate Binary Neural Networks via Modeling Contextual Dependencies [52.691032025163175]
Existing Binary Neural Networks (BNNs) operate mainly on local convolutions with binarization function. We present new designs of binary neural modules, which enables leading binary neural modules by a large margin.
arXiv Detail & Related papers (2022-09-03T11:51:04Z)
Repo2Vec: A Comprehensive Embedding Approach for Determining Repository Similarity [2.095199622772379]
Repo2Vec is a comprehensive embedding approach to represent a repository as a distributed vector. We evaluate our method with two real datasets from GitHub for a combined 1013 repositories.
arXiv Detail & Related papers (2021-07-11T18:57:03Z)
Contextualizing Meta-Learning via Learning to Decompose [125.76658595408607]
We propose Learning to Decompose Network (LeadNet) to contextualize the meta-learned support-to-target'' strategy. LeadNet learns to automatically select the strategy associated with the right via incorporating the change of comparison across contexts with polysemous embeddings.
arXiv Detail & Related papers (2021-06-15T13:10:56Z)
D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential Analysis [55.15995704119158]
We propose D2A, a differential analysis based approach to label issues reported by static analysis tools. We use D2A to generate a large labeled dataset to train models for vulnerability identification.
arXiv Detail & Related papers (2021-02-16T07:46:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.