DecompileBench: A Comprehensive Benchmark for Evaluating Decompilers in Real-World Scenarios
- URL: http://arxiv.org/abs/2505.11340v1
- Date: Fri, 16 May 2025 15:07:43 GMT
- Title: DecompileBench: A Comprehensive Benchmark for Evaluating Decompilers in Real-World Scenarios
- Authors: Zeyu Gao, Yuxin Cui, Hao Wang, Siliang Qin, Yuanda Wang, Bolun Zhang, Chao Zhang,
- Abstract summary: Existing approaches primarily focus on syntactic correctness through synthetic micro-benchmarks or subjective human ratings.<n>We present DecompileBench, the first comprehensive framework that enables effective evaluation of decompilers in reverse engineering.
- Score: 9.284467500179922
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Decompilers are fundamental tools for critical security tasks, from vulnerability discovery to malware analysis, yet their evaluation remains fragmented. Existing approaches primarily focus on syntactic correctness through synthetic micro-benchmarks or subjective human ratings, failing to address real-world requirements for semantic fidelity and analyst usability. We present DecompileBench, the first comprehensive framework that enables effective evaluation of decompilers in reverse engineering workflows through three key components: \textit{real-world function extraction} (comprising 23,400 functions from 130 real-world programs), \textit{runtime-aware validation}, and \textit{automated human-centric assessment} using LLM-as-Judge to quantify the effectiveness of decompilers in reverse engineering workflows. Through a systematic comparison between six industrial-strength decompilers and six recent LLM-powered approaches, we demonstrate that LLM-based methods surpass commercial tools in code understandability despite 52.2% lower functionality correctness. These findings highlight the potential of LLM-based approaches to transform human-centric reverse engineering. We open source \href{https://github.com/Jennieett/DecompileBench}{DecompileBench} to provide a framework to advance research on decompilers and assist security experts in making informed tool selections based on their specific requirements.
Related papers
- Running in CIRCLE? A Simple Benchmark for LLM Code Interpreter Security [0.0]
Large language models (LLMs) increasingly integrate native code interpreters, enabling real-time execution capabilities.<n>These integrations introduce potential system-level cybersecurity threats, fundamentally different from prompt-based vulnerabilities.<n>We propose CIRCLE (Code-Interpreter Resilience Check for LLM Exploits), a simple benchmark comprising 1,260 prompts targeting CPU, memory, and disk resource exhaustion.
arXiv Detail & Related papers (2025-07-25T16:06:16Z) - Training Language Models to Generate Quality Code with Program Analysis Feedback [66.0854002147103]
Code generation with large language models (LLMs) is increasingly adopted in production but fails to ensure code quality.<n>We propose REAL, a reinforcement learning framework that incentivizes LLMs to generate production-quality code.
arXiv Detail & Related papers (2025-05-28T17:57:47Z) - MCP-RADAR: A Multi-Dimensional Benchmark for Evaluating Tool Use Capabilities in Large Language Models [11.809732662992982]
This paper introduces MCP-RADAR, the first comprehensive benchmark specifically designed to evaluate Large Language Models (LLMs) performance in the Model Context Protocol (MCP) framework.<n>Unlike conventional benchmarks that rely on subjective human evaluations or binary success metrics, MCP-RADAR employs objective, quantifiable measurements across multiple task domains.
arXiv Detail & Related papers (2025-05-22T14:02:37Z) - OSS-Bench: Benchmark Generator for Coding LLMs [4.393587297483245]
We introduce OSS-Bench, a benchmark generator that constructs large-scale, live evaluation tasks from real-world open-source software.<n> OSS-Bench replaces functions with LLM-generated code and evaluates them using three natural metrics: compilability, functional correctness, and memory safety.<n>Our results demonstrate that OSS-Bench mitigates overfitting by leveraging the evolving complexity of OSS.
arXiv Detail & Related papers (2025-05-18T09:53:51Z) - CompileAgent: Automated Real-World Repo-Level Compilation with Tool-Integrated LLM-based Agent System [52.048087777953064]
We propose CompileAgent, an agent framework dedicated to repo-level compilation.<n>CompileAgent integrates five tools and a flow-based agent strategy, enabling interaction with software artifacts for compilation instruction search and error resolution.<n>We show that our method significantly improves the compilation success rate, ranging from 10% to 71%.
arXiv Detail & Related papers (2025-05-07T08:59:14Z) - Computational Reasoning of Large Language Models [51.629694188014064]
We introduce textbfTuring Machine Bench, a benchmark to assess the ability of Large Language Models (LLMs) to execute reasoning processes.<n> TMBench incorporates four key features: self-contained and knowledge-agnostic reasoning, a minimalistic multi-step structure, controllable difficulty, and a theoretical foundation based on Turing machine.
arXiv Detail & Related papers (2025-04-29T13:52:47Z) - CASTLE: Benchmarking Dataset for Static Code Analyzers and LLMs towards CWE Detection [2.5228276786940182]
This paper introduces CASTLE, a benchmarking framework for evaluating the vulnerability detection capabilities of different methods.<n>We assess 13 static analysis tools, 10 LLMs, and 2 formal verification tools using a hand-crafted dataset of 250 micro-benchmark programs covering 25 common CWEs.
arXiv Detail & Related papers (2025-03-12T14:30:05Z) - ReF Decompile: Relabeling and Function Call Enhanced Decompile [50.86228893636785]
The goal of decompilation is to convert compiled low-level code (e.g., assembly code) back into high-level programming languages.<n>This task supports various reverse engineering applications, such as vulnerability identification, malware analysis, and legacy software migration.
arXiv Detail & Related papers (2025-02-17T12:38:57Z) - LLM-Powered Benchmark Factory: Reliable, Generic, and Efficient [19.673388630963807]
We propose an automated and unbiased evaluation framework, structured around four dimensions and ten criteria.<n>Under this framework, we analyze the advantages and weaknesses of directly prompting large language models (LLMs) as generic benchmark generators.<n>We then introduce a series of methods to address the identified weaknesses and integrate them as BenchMaker.<n>Experiments confirm that BenchMaker achieves superior or comparable performance to human-annotated benchmarks on all metrics.
arXiv Detail & Related papers (2025-02-02T06:36:01Z) - SCoPE: Evaluating LLMs for Software Vulnerability Detection [0.0]
This work explores and refines the CVEFixes dataset, which is commonly used to train models for code-related tasks.
The output generated by SCoPE was used to create a new version of CVEFixes.
The results show that SCoPE successfully helped to identify 905 duplicates within the evaluated subset.
arXiv Detail & Related papers (2024-07-19T15:02:00Z) - SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal [64.9938658716425]
SORRY-Bench is a proposed benchmark for evaluating large language models' (LLMs) ability to recognize and reject unsafe user requests.<n>First, existing methods often use coarse-grained taxonomy of unsafe topics, and are over-representing some fine-grained topics.<n>Second, linguistic characteristics and formatting of prompts are often overlooked, like different languages, dialects, and more -- which are only implicitly considered in many evaluations.
arXiv Detail & Related papers (2024-06-20T17:56:07Z) - Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities [12.82645410161464]
We evaluate the effectiveness of 16 pre-trained Large Language Models on 5,000 code samples from five diverse security datasets.
Overall, LLMs show modest effectiveness in detecting vulnerabilities, obtaining an average accuracy of 62.8% and F1 score of 0.71 across datasets.
We find that advanced prompting strategies that involve step-by-step analysis significantly improve performance of LLMs on real-world datasets in terms of F1 score (by upto 0.18 on average)
arXiv Detail & Related papers (2023-11-16T13:17:20Z) - MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language
Feedback [78.60644407028022]
We introduce MINT, a benchmark that evaluates large language models' ability to solve tasks with multi-turn interactions.
LLMs generally benefit from tools and language feedback, with performance gains of 1-8% for each turn of tool use.
LLMs evaluated, supervised instruction-finetuning (SIFT) and reinforcement learning from human feedback (RLHF) generally hurt multi-turn capabilities.
arXiv Detail & Related papers (2023-09-19T15:25:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.