Cross-modal Retrieval Models for Stripped Binary Analysis
- URL: http://arxiv.org/abs/2512.10393v1
- Date: Thu, 11 Dec 2025 07:58:10 GMT
- Title: Cross-modal Retrieval Models for Stripped Binary Analysis
- Authors: Guoqiang Chen, Lingyun Ying, Ziyang Song, Daguang Liu, Qiang Wang, Zhiqi Wang, Li Hu, Shaoyin Cheng, Weiming Zhang, Nenghai Yu,
- Abstract summary: BinSeek is the first two-stage cross-modal retrieval framework for stripped binary code analysis.<n>It consists of two models: BinSeekEmbedding is trained on large-scale dataset to learn the semantic relevance of the binary code.<n>BinSeek-Reranker learns to carefully judge the relevance of the candidate code to the description with context augmentation.
- Score: 62.89251403093734
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: LLM-agent based binary code analysis has demonstrated significant potential across a wide range of software security scenarios, including vulnerability detection, malware analysis, etc. In agent workflow, however, retrieving the positive from thousands of stripped binary functions based on user query remains under-studied and challenging, as the absence of symbolic information distinguishes it from source code retrieval. In this paper, we introduce, BinSeek, the first two-stage cross-modal retrieval framework for stripped binary code analysis. It consists of two models: BinSeekEmbedding is trained on large-scale dataset to learn the semantic relevance of the binary code and the natural language description, furthermore, BinSeek-Reranker learns to carefully judge the relevance of the candidate code to the description with context augmentation. To this end, we built an LLM-based data synthesis pipeline to automate training construction, also deriving a domain benchmark for future research. Our evaluation results show that BinSeek achieved the state-of-the-art performance, surpassing the the same scale models by 31.42% in Rec@3 and 27.17% in MRR@3, as well as leading the advanced general-purpose models that have 16 times larger parameters.
Related papers
- Loong: Synthesize Long Chain-of-Thoughts at Scale through Verifiers [103.4410890572479]
We introduce the Loong Project: an open-source framework for scalable synthetic data generation and verification.<n>LoongBench is a curated seed dataset containing 8,729 human-vetted examples across 12 domains.<n>LoongEnv is a modular synthetic data generation environment that supports multiple prompting strategies to produce new question-answer-code triples.
arXiv Detail & Related papers (2025-09-03T06:42:40Z) - ORCAS: Obfuscation-Resilient Binary Code Similarity Analysis using Dominance Enhanced Semantic Graph [11.990392428275179]
We develop ORCAS, an Obfuscation-Resilient BCSA model based on Dominance Enhanced Semantic Graph (DESG)<n>DESG is an original binary code representation, capturing more binaries' implicit semantics without control flow structure.<n> ORCAS outperforms the state-of-the-art approaches over this newly released real-world vulnerability dataset by up to a recall improvement of 43%.
arXiv Detail & Related papers (2025-06-06T15:26:53Z) - BinMetric: A Comprehensive Binary Analysis Benchmark for Large Language Models [50.17907898478795]
We introduce BinMetric, a benchmark designed to evaluate the performance of large language models on binary analysis tasks.<n>BinMetric comprises 1,000 questions derived from 20 real-world open-source projects across 6 practical binary analysis tasks.<n>Our empirical study on this benchmark investigates the binary analysis capabilities of various state-of-the-art LLMs, revealing their strengths and limitations in this field.
arXiv Detail & Related papers (2025-05-12T08:54:07Z) - An Empirical Study on the Effectiveness of Large Language Models for Binary Code Understanding [50.17907898478795]
This work proposes a benchmark to evaluate the effectiveness of Large Language Models (LLMs) in real-world reverse engineering scenarios.<n>Our evaluations reveal that existing LLMs can understand binary code to a certain extent, thereby improving the efficiency of binary code analysis.
arXiv Detail & Related papers (2025-04-30T17:02:06Z) - Binary Code Similarity Detection via Graph Contrastive Learning on Intermediate Representations [52.34030226129628]
Binary Code Similarity Detection (BCSD) plays a crucial role in numerous fields, including vulnerability detection, malware analysis, and code reuse identification.
In this paper, we propose IRBinDiff, which mitigates compilation differences by leveraging LLVM-IR with higher-level semantic abstraction.
Our extensive experiments, conducted under varied compilation settings, demonstrate that IRBinDiff outperforms other leading BCSD methods in both One-to-one comparison and One-to-many search scenarios.
arXiv Detail & Related papers (2024-10-24T09:09:20Z) - BinSimDB: Benchmark Dataset Construction for Fine-Grained Binary Code Similarity Analysis [6.093226756571566]
We construct a benchmark dataset for fine-grained binary code similarity analysis called BinSimDB.
Specifically, we propose BMerge and BPair algorithms to bridge the discrepancies between two binary code snippets.
The experimental results demonstrate that BinSimDB significantly improves the performance of binary code similarity comparison.
arXiv Detail & Related papers (2024-10-14T05:13:48Z) - Unsupervised Binary Code Translation with Application to Code Similarity Detection and Vulnerability Discovery [2.022692275087205]
Cross-architecture binary code analysis has become an emerging problem.
Deep learning-based binary analysis has shown promising success.
For some low-resource ISAs, an adequate amount of data is hard to find.
arXiv Detail & Related papers (2024-04-29T18:09:28Z) - How Far Have We Gone in Binary Code Understanding Using Large Language Models [51.527805834378974]
We propose a benchmark to evaluate the effectiveness of Large Language Models (LLMs) in binary code understanding.
Our evaluations reveal that existing LLMs can understand binary code to a certain extent, thereby improving the efficiency of binary code analysis.
arXiv Detail & Related papers (2024-04-15T14:44:08Z) - FASER: Binary Code Similarity Search through the use of Intermediate
Representations [0.8594140167290099]
Cross-Architecture Binary Code Similarity Search has been explored in numerous studies.
We propose Function as a String Encoded Representation (FASER) to create a model capable of cross architecture function search.
arXiv Detail & Related papers (2023-10-05T15:36:35Z) - UniASM: Binary Code Similarity Detection without Fine-tuning [2.2329530239800035]
We propose a novel rich-semantic function representation technique to ensure the model captures the intricate nuances of binary code.<n>We introduce the first UniLM-based binary code embedding model, named UniASM, which includes two newly designed training tasks.<n>The experimental results show that UniASM outperforms the state-of-the-art (SOTA) approaches on the evaluation datasets.
arXiv Detail & Related papers (2022-10-28T14:04:57Z) - Semantic-aware Binary Code Representation with BERT [27.908093567605484]
A wide range of binary analysis applications, such as bug discovery, malware analysis and code clone detection, require recovery of contextual meanings on a binary code.
Recently, binary analysis techniques based on machine learning have been proposed to automatically reconstruct the code representation of a binary.
In this paper, we propose DeepSemantic utilizing BERT in producing the semantic-aware code representation of a binary code.
arXiv Detail & Related papers (2021-06-10T03:31:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.