Related papers: DecompileBench: A Comprehensive Benchmark for Evaluating Decompilers in Real-World Scenarios

DecompileBench: A Comprehensive Benchmark for Evaluating Decompilers in Real-World Scenarios

URL: http://arxiv.org/abs/2505.11340v1
Date: Fri, 16 May 2025 15:07:43 GMT
Title: DecompileBench: A Comprehensive Benchmark for Evaluating Decompilers in Real-World Scenarios
Authors: Zeyu Gao, Yuxin Cui, Hao Wang, Siliang Qin, Yuanda Wang, Bolun Zhang, Chao Zhang,
Abstract summary: Existing approaches primarily focus on syntactic correctness through synthetic micro-benchmarks or subjective human ratings.<n>We present DecompileBench, the first comprehensive framework that enables effective evaluation of decompilers in reverse engineering.
Score: 9.284467500179922
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Decompilers are fundamental tools for critical security tasks, from vulnerability discovery to malware analysis, yet their evaluation remains fragmented. Existing approaches primarily focus on syntactic correctness through synthetic micro-benchmarks or subjective human ratings, failing to address real-world requirements for semantic fidelity and analyst usability. We present DecompileBench, the first comprehensive framework that enables effective evaluation of decompilers in reverse engineering workflows through three key components: \textit{real-world function extraction} (comprising 23,400 functions from 130 real-world programs), \textit{runtime-aware validation}, and \textit{automated human-centric assessment} using LLM-as-Judge to quantify the effectiveness of decompilers in reverse engineering workflows. Through a systematic comparison between six industrial-strength decompilers and six recent LLM-powered approaches, we demonstrate that LLM-based methods surpass commercial tools in code understandability despite 52.2% lower functionality correctness. These findings highlight the potential of LLM-based approaches to transform human-centric reverse engineering. We open source \href{https://github.com/Jennieett/DecompileBench}{DecompileBench} to provide a framework to advance research on decompilers and assist security experts in making informed tool selections based on their specific requirements.

Related papers

Running in CIRCLE? A Simple Benchmark for LLM Code Interpreter Security [0.0]
Large language models (LLMs) increasingly integrate native code interpreters, enabling real-time execution capabilities.<n>These integrations introduce potential system-level cybersecurity threats, fundamentally different from prompt-based vulnerabilities.<n>We propose CIRCLE (Code-Interpreter Resilience Check for LLM Exploits), a simple benchmark comprising 1,260 prompts targeting CPU, memory, and disk resource exhaustion.
arXiv Detail & Related papers (2025-07-25T16:06:16Z)
Training Language Models to Generate Quality Code with Program Analysis Feedback [66.0854002147103]
Code generation with large language models (LLMs) is increasingly adopted in production but fails to ensure code quality.<n>We propose REAL, a reinforcement learning framework that incentivizes LLMs to generate production-quality code.
arXiv Detail & Related papers (2025-05-28T17:57:47Z)
MCP-RADAR: A Multi-Dimensional Benchmark for Evaluating Tool Use Capabilities in Large Language Models [11.809732662992982]
This paper introduces MCP-RADAR, the first comprehensive benchmark specifically designed to evaluate Large Language Models (LLMs) performance in the Model Context Protocol (MCP) framework.<n>Unlike conventional benchmarks that rely on subjective human evaluations or binary success metrics, MCP-RADAR employs objective, quantifiable measurements across multiple task domains.
arXiv Detail & Related papers (2025-05-22T14:02:37Z)
OSS-Bench: Benchmark Generator for Coding LLMs [4.393587297483245]
We introduce OSS-Bench, a benchmark generator that constructs large-scale, live evaluation tasks from real-world open-source software.<n> OSS-Bench replaces functions with LLM-generated code and evaluates them using three natural metrics: compilability, functional correctness, and memory safety.<n>Our results demonstrate that OSS-Bench mitigates overfitting by leveraging the evolving complexity of OSS.
arXiv Detail & Related papers (2025-05-18T09:53:51Z)
CompileAgent: Automated Real-World Repo-Level Compilation with Tool-Integrated LLM-based Agent System [52.048087777953064]
We propose CompileAgent, an agent framework dedicated to repo-level compilation.<n>CompileAgent integrates five tools and a flow-based agent strategy, enabling interaction with software artifacts for compilation instruction search and error resolution.<n>We show that our method significantly improves the compilation success rate, ranging from 10% to 71%.
arXiv Detail & Related papers (2025-05-07T08:59:14Z)
Computational Reasoning of Large Language Models [51.629694188014064]
We introduce textbfTuring Machine Bench, a benchmark to assess the ability of Large Language Models (LLMs) to execute reasoning processes.<n> TMBench incorporates four key features: self-contained and knowledge-agnostic reasoning, a minimalistic multi-step structure, controllable difficulty, and a theoretical foundation based on Turing machine.
arXiv Detail & Related papers (2025-04-29T13:52:47Z)
CASTLE: Benchmarking Dataset for Static Code Analyzers and LLMs towards CWE Detection [2.5228276786940182]
This paper introduces CASTLE, a benchmarking framework for evaluating the vulnerability detection capabilities of different methods.<n>We assess 13 static analysis tools, 10 LLMs, and 2 formal verification tools using a hand-crafted dataset of 250 micro-benchmark programs covering 25 common CWEs.
arXiv Detail & Related papers (2025-03-12T14:30:05Z)
ReF Decompile: Relabeling and Function Call Enhanced Decompile [50.86228893636785]
The goal of decompilation is to convert compiled low-level code (e.g., assembly code) back into high-level programming languages.<n>This task supports various reverse engineering applications, such as vulnerability identification, malware analysis, and legacy software migration.
arXiv Detail & Related papers (2025-02-17T12:38:57Z)
LLM-Powered Benchmark Factory: Reliable, Generic, and Efficient [19.673388630963807]
We propose an automated and unbiased evaluation framework, structured around four dimensions and ten criteria.<n>Under this framework, we analyze the advantages and weaknesses of directly prompting large language models (LLMs) as generic benchmark generators.<n>We then introduce a series of methods to address the identified weaknesses and integrate them as BenchMaker.<n>Experiments confirm that BenchMaker achieves superior or comparable performance to human-annotated benchmarks on all metrics.
arXiv Detail & Related papers (2025-02-02T06:36:01Z)
SCoPE: Evaluating LLMs for Software Vulnerability Detection [0.0]
This work explores and refines the CVEFixes dataset, which is commonly used to train models for code-related tasks. The output generated by SCoPE was used to create a new version of CVEFixes. The results show that SCoPE successfully helped to identify 905 duplicates within the evaluated subset.
arXiv Detail & Related papers (2024-07-19T15:02:00Z)
SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal [64.9938658716425]
SORRY-Bench is a proposed benchmark for evaluating large language models' (LLMs) ability to recognize and reject unsafe user requests.<n>First, existing methods often use coarse-grained taxonomy of unsafe topics, and are over-representing some fine-grained topics.<n>Second, linguistic characteristics and formatting of prompts are often overlooked, like different languages, dialects, and more -- which are only implicitly considered in many evaluations.
arXiv Detail & Related papers (2024-06-20T17:56:07Z)
Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities [12.82645410161464]
We evaluate the effectiveness of 16 pre-trained Large Language Models on 5,000 code samples from five diverse security datasets. Overall, LLMs show modest effectiveness in detecting vulnerabilities, obtaining an average accuracy of 62.8% and F1 score of 0.71 across datasets. We find that advanced prompting strategies that involve step-by-step analysis significantly improve performance of LLMs on real-world datasets in terms of F1 score (by upto 0.18 on average)
arXiv Detail & Related papers (2023-11-16T13:17:20Z)
MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback [78.60644407028022]
We introduce MINT, a benchmark that evaluates large language models' ability to solve tasks with multi-turn interactions. LLMs generally benefit from tools and language feedback, with performance gains of 1-8% for each turn of tool use. LLMs evaluated, supervised instruction-finetuning (SIFT) and reinforcement learning from human feedback (RLHF) generally hurt multi-turn capabilities.
arXiv Detail & Related papers (2023-09-19T15:25:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.