Related papers: What Makes the Best Decomposition? Investigating Binary Decomposition Under FCG Variance

What Makes the Best Decomposition? Investigating Binary Decomposition Under FCG Variance

URL: http://arxiv.org/abs/2506.19425v1
Date: Tue, 24 Jun 2025 08:43:28 GMT
Title: What Makes the Best Decomposition? Investigating Binary Decomposition Under FCG Variance
Authors: Ang Jia, He Jiang, Zhilei Ren, Xiaochen Li, Ming Fan, Ting Liu,
Abstract summary: We conduct the first systematic empirical study on the variance of function call graphs compiled by various compilation settings.<n>We find that the size of FCGs changes dramatically, while the FCGs are still linked by three different kinds of mappings.<n>We propose a method to identify the optimal decomposition and compare the existing decomposition works with the optimal decomposition.
Score: 11.950334662727848
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Binary decomposition, which decomposes binary files into modules, plays a critical role in binary reuse detection. Existing binary decomposition works either apply anchor-based methods by extending anchor functions to generate modules, or apply clustering-based methods by using clustering algorithms to group binary functions, which all rely on that reused code shares similar function call relationships. However, we find that function call graphs (FCGs) vary a lot when using different compilation settings, especially with diverse function inlining decisions. In this work, we conduct the first systematic empirical study on the variance of FCGs compiled by various compilation settings and explore its effect on binary decomposition methods. We first construct a dataset compiled by 17 compilers, using 6 optimizations to 4 architectures and analyze the changes and mappings of the FCGs. We find that the size of FCGs changes dramatically, while the FCGs are still linked by three different kinds of mappings. Then we evaluate the existing works under the FCG variance, and results show that existing works are facing great challenges when conducting cross-compiler evaluation with diverse optimization settings. Finally, we propose a method to identify the optimal decomposition and compare the existing decomposition works with the optimal decomposition. Existing works either suffer from low coverage or cannot generate stable community similarities.

Related papers

BinCoFer: Three-Stage Purification for Effective C/C++ Binary Third-Party Library Detection [3.406168883492101]
Third-party libraries (TPL) are becoming increasingly popular to achieve efficient and concise software development.<n> unregulated use of TPL will introduce legal and security issues in software development.<n>BinCoFer is a tool designed for detecting TPLs reused in binary programs.
arXiv Detail & Related papers (2025-04-28T07:57:42Z)
Beyond the Edge of Function: Unraveling the Patterns of Type Recovery in Binary Code [55.493408628371235]
We propose ByteTR, a framework for recovering variable types in binary code.<n>In light of the ubiquity of variable propagation across functions, ByteTR conducts inter-procedural analysis to trace variable propagation and employs a gated graph neural network to capture long-range data flow dependencies for variable type recovery.
arXiv Detail & Related papers (2025-03-10T12:27:05Z)
FoC: Figure out the Cryptographic Functions in Stripped Binaries with LLMs [51.898805184427545]
We propose a novel framework called FoC to Figure out the Cryptographic functions in stripped binaries.<n>We first build a binary large language model (FoC-BinLLM) to summarize the semantics of cryptographic functions in natural language.<n>We then build a binary code similarity model (FoC-Sim) upon the FoC-BinLLM to create change-sensitive representations and use it to retrieve similar implementations of unknown cryptographic functions in a database.
arXiv Detail & Related papers (2024-03-27T09:45:33Z)
Simple Multigraph Convolution Networks [49.19906483875984]
Existing multigraph convolution methods either ignore the cross-view interaction among multiple graphs, or induce extremely high computational cost due to standard cross-view operators. This paper proposes a Simple Multi Convolution Networks (SMGCN) which first extracts consistent cross-view topology from multigraphs including edge-level and subgraph-level topology, then performs expansion based on raw multigraphs and consistent topologies. In theory, SMGCN utilizes the consistent topologies in expansion rather than standard cross-view expansion, which performs credible cross-view spatial message-passing, and effectively reduces the complexity of standard expansion.
arXiv Detail & Related papers (2024-03-08T03:27:58Z)
Cross-Inlining Binary Function Similarity Detection [16.923959153965857]
We propose a pattern-based model named CI-Detector for cross-inlining matching. Results show that CI-Detector can detect cross-inlining pairs with a precision of 81% and a recall of 97%, which exceeds all state-of-the-art works.
arXiv Detail & Related papers (2024-01-11T08:42:08Z)
Factorizers for Distributed Sparse Block Codes [45.29870215671697]
We propose a fast and highly accurate method for factorizing distributed block codes (SBCs) Our iterative factorizer introduces a threshold-based nonlinear activation, conditional random sampling, and an $ell_infty$-based similarity metric. We demonstrate the feasibility of our method on four deep CNN architectures over CIFAR-100, ImageNet-1K, and RAVEN datasets.
arXiv Detail & Related papers (2023-03-24T12:31:48Z)
Are Random Decompositions all we need in High Dimensional Bayesian Optimisation? [4.31133019086586]
We find that data-driven learners of decompositions can be easily misled towards local decompositions that do not hold globally. A random tree-based decomposition sampler exhibits favourable theoretical guarantees that effectively trade off maximal information gain and functional mismatch between the actual black-box and its surrogate. Those results motivate the development of the random decomposition upper-confidence bound algorithm (RDUCB) that is straightforward to implement - (almost) plug-and-play.
arXiv Detail & Related papers (2023-01-30T12:54:46Z)
Deep ensembles based on Stochastic Activation Selection for Polyp Segmentation [82.61182037130406]
This work deals with medical image segmentation and in particular with accurate polyp detection and segmentation during colonoscopy examinations. Basic architecture in image segmentation consists of an encoder and a decoder. We compare some variant of the DeepLab architecture obtained by varying the decoder backbone.
arXiv Detail & Related papers (2021-04-02T02:07:37Z)
Augmentation of the Reconstruction Performance of Fuzzy C-Means with an Optimized Fuzzification Factor Vector [99.19847674810079]
Fuzzy C-Means (FCM) is one of the most frequently used methods to construct information granules. In this paper, we augment the FCM-based degranulation mechanism by introducing a vector of fuzzification factors. Experiments completed for both synthetic and publicly available datasets show that the proposed approach outperforms the generic data reconstruction approach.
arXiv Detail & Related papers (2020-04-13T04:17:30Z)
Kullback-Leibler Divergence-Based Fuzzy $C$-Means Clustering Incorporating Morphological Reconstruction and Wavelet Frames for Image Segmentation [152.609322951917]
We come up with a Kullback-Leibler (KL) divergence-based Fuzzy C-Means (FCM) algorithm by incorporating a tight wavelet frame transform and a morphological reconstruction operation. The proposed algorithm works well and comes with better segmentation performance than other comparative algorithms.
arXiv Detail & Related papers (2020-02-21T05:19:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.