Related papers: BinCoFer: Three-Stage Purification for Effective C/C++ Binary Third-Party Library Detection

BinCoFer: Three-Stage Purification for Effective C/C++ Binary Third-Party Library Detection

URL: http://arxiv.org/abs/2504.19551v1
Date: Mon, 28 Apr 2025 07:57:42 GMT
Title: BinCoFer: Three-Stage Purification for Effective C/C++ Binary Third-Party Library Detection
Authors: Yayi Zou, Yixiang Zhang, Guanghao Zhao, Yueming Wu, Shuhao Shen, Cai Fu,
Abstract summary: Third-party libraries (TPL) are becoming increasingly popular to achieve efficient and concise software development.<n> unregulated use of TPL will introduce legal and security issues in software development.<n>BinCoFer is a tool designed for detecting TPLs reused in binary programs.
Score: 3.406168883492101
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Third-party libraries (TPL) are becoming increasingly popular to achieve efficient and concise software development. However, unregulated use of TPL will introduce legal and security issues in software development. Consequently, some studies have attempted to detect the reuse of TPLs in target programs by constructing a feature repository. Most of the works require access to the source code of TPLs, while the others suffer from redundancy in the repository, low detection efficiency, and difficulties in detecting partially referenced third-party libraries. Therefore, we introduce BinCoFer, a tool designed for detecting TPLs reused in binary programs. We leverage the work of binary code similarity detection(BCSD) to extract binary-format TPL features, making it suitable for scenarios where the source code of TPLs is inaccessible. BinCoFer employs a novel three-stage purification strategy to mitigate feature repository redundancy by highlighting core functions and extracting function-level features, making it applicable to scenarios of partial reuse of TPLs. We have observed that directly using similarity threshold to determine the reuse between two binary functions is inaccurate, a problem that previous work has not addressed. Thus we design a method that uses weight to aggregate the similarity between functions in the target binary and core functions to ultimately judge the reuse situation with high frequency. To examine the ability of BinCoFer, we compiled a dataset on ArchLinux and conduct comparative experiments on it with other four most related works (i.e., ModX, B2SFinder, LibAM and BinaryAI)...

Related papers

An Empirical Study on the Effectiveness of Large Language Models for Binary Code Understanding [50.17907898478795]
This work proposes a benchmark to evaluate the effectiveness of Large Language Models (LLMs) in real-world reverse engineering scenarios. Our evaluations reveal that existing LLMs can understand binary code to a certain extent, thereby improving the efficiency of binary code analysis.
arXiv Detail & Related papers (2025-04-30T17:02:06Z)
ReF Decompile: Relabeling and Function Call Enhanced Decompile [50.86228893636785]
The goal of decompilation is to convert compiled low-level code (e.g., assembly code) back into high-level programming languages. This task supports various reverse engineering applications, such as vulnerability identification, malware analysis, and legacy software migration.
arXiv Detail & Related papers (2025-02-17T12:38:57Z)
Bi-Directional Transformers vs. word2vec: Discovering Vulnerabilities in Lifted Compiled Code [4.956066467858057]
This research explores vulnerability detection using natural language processing (NLP) embedding techniques with word2vec, BERT, and RoBERTa. Long short-term memory (LSTM) neural networks were trained on embeddings from encoders created using approximately 48k LLVM functions from the Juliet dataset.
arXiv Detail & Related papers (2024-05-31T03:57:19Z)
FoC: Figure out the Cryptographic Functions in Stripped Binaries with LLMs [51.898805184427545]
We propose a novel framework called FoC to Figure out the Cryptographic functions in stripped binaries.<n>We first build a binary large language model (FoC-BinLLM) to summarize the semantics of cryptographic functions in natural language.<n>We then build a binary code similarity model (FoC-Sim) upon the FoC-BinLLM to create change-sensitive representations and use it to retrieve similar implementations of unknown cryptographic functions in a database.
arXiv Detail & Related papers (2024-03-27T09:45:33Z)
ReGAL: Refactoring Programs to Discover Generalizable Abstractions [59.05769810380928]
Generalizable Abstraction Learning (ReGAL) is a method for learning a library of reusable functions via codeization. We find that the shared function libraries discovered by ReGAL make programs easier to predict across diverse domains. For CodeLlama-13B, ReGAL results in absolute accuracy increases of 11.5% on LOGO, 26.1% on date understanding, and 8.1% on TextCraft, outperforming GPT-3.5 in two of three domains.
arXiv Detail & Related papers (2024-01-29T18:45:30Z)
BinaryAI: Binary Software Composition Analysis via Intelligent Binary Source Code Matching [8.655595404611821]
We introduce BinaryAI, a novel binary-to-source SCA technique with two-phase binary source code matching to capture both syntactic and semantic code features. Our experimental results demonstrate the superior performance of BinaryAI in terms of binary source code matching and the downstream SCA task.
arXiv Detail & Related papers (2024-01-20T07:57:57Z)
Cross-Inlining Binary Function Similarity Detection [16.923959153965857]
We propose a pattern-based model named CI-Detector for cross-inlining matching. Results show that CI-Detector can detect cross-inlining pairs with a precision of 81% and a recall of 97%, which exceeds all state-of-the-art works.
arXiv Detail & Related papers (2024-01-11T08:42:08Z)
CP-BCS: Binary Code Summarization Guided by Control Flow Graph and Pseudo Code [79.87518649544405]
We present a control flow graph and pseudo code guided binary code summarization framework called CP-BCS. CP-BCS utilizes a bidirectional instruction-level control flow graph and pseudo code that incorporates expert knowledge to learn the comprehensive binary function execution behavior and logic semantics.
arXiv Detail & Related papers (2023-10-24T14:20:39Z)
LibAM: An Area Matching Framework for Detecting Third-party Libraries in Binaries [28.877355564114904]
Third-party libraries (TPLs) are utilized by developers to expedite the software development process and incorporate external functionalities. Insecure TPL reuse can lead to significant security risks. We introduce LibAM, a novel Area Matching framework that connects isolated functions into function areas on Function Call Graph.
arXiv Detail & Related papers (2023-05-06T12:26:56Z)
Towards Accurate Binary Neural Networks via Modeling Contextual Dependencies [52.691032025163175]
Existing Binary Neural Networks (BNNs) operate mainly on local convolutions with binarization function. We present new designs of binary neural modules, which enables leading binary neural modules by a large margin.
arXiv Detail & Related papers (2022-09-03T11:51:04Z)
Disentangle Your Dense Object Detector [82.22771433419727]
Deep learning-based dense object detectors have achieved great success in the past few years and have been applied to numerous multimedia applications such as video understanding. However, the current training pipeline for dense detectors is compromised to lots of conjunctions that may not hold. We propose Disentangled Dense Object Detector (DDOD), in which simple and effective disentanglement mechanisms are designed and integrated into the current state-of-the-art detectors.
arXiv Detail & Related papers (2021-07-07T00:52:16Z)
On using distributed representations of source code for the detection of C security vulnerabilities [14.8831988481175]
This paper presents an evaluation of the code representation model Code2vec when trained on the task of detecting security vulnerabilities in C source code. We leverage the open-source library astminer to extract path-contexts from the abstract syntax trees of a corpus of labeled C functions. Code2vec is trained on the resulting path-contexts with the task of classifying a function as vulnerable or non-vulnerable.
arXiv Detail & Related papers (2021-06-01T21:18:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.