Related papers: Can LLMs Recover Program Semantics? A Systematic Evaluation with Symbolic Execution

Can LLMs Recover Program Semantics? A Systematic Evaluation with Symbolic Execution

URL: http://arxiv.org/abs/2511.19130v1
Date: Mon, 24 Nov 2025 13:55:20 GMT
Title: Can LLMs Recover Program Semantics? A Systematic Evaluation with Symbolic Execution
Authors: Rong Feng, Suman Saha,
Abstract summary: Obfuscation poses a persistent challenge for software engineering tasks such as program comprehension, maintenance, testing, and vulnerability detection.<n>We investigate whether fine-tuned language models can effectively deobfuscate programs and restore analyzability.
Score: 1.5377279217726239
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Obfuscation poses a persistent challenge for software engineering tasks such as program comprehension, maintenance, testing, and vulnerability detection. While compiler optimizations and third-party code often introduce transformations that obscure program intent, existing analysis tools and large language models (LLMs) struggle to recover the original semantics. In this work, we investigate whether LLMs, when fine-tuned with symbolic execution artifacts, can effectively deobfuscate programs and restore analyzability. We construct a benchmark by applying four widely studied transformations-control-flow flattening, opaque predicates, arithmetic encoding, and branch encoding-across diverse C programs from TUM Obfuscation Benchmarks, the LLVM test suite, and algorithmic repositories. We then compare three state-of-the-art LLMs under two training configurations: baseline fine-tuning on obfuscated/original code pairs, and enhanced fine-tuning with additional KLEE artifacts such as SMT constraints, path statistics, and test cases. Our evaluation examines syntactic correctness (compilation success), semantic fidelity (behavioral equivalence under symbolic execution), and code quality (readability and structure). Results show that GPT-4.1-mini achieves the strongest deobfuscation overall, and that incorporating KLEE artifacts consistently improves semantic preservation and compilation success across models. These findings highlight deobfuscation as a broader software engineering concern, demonstrating that combining LLMs with symbolic execution can strengthen automated testing, static analysis, and program comprehension in the presence of obfuscation.

Related papers

Context-Guided Decompilation: A Step Towards Re-executability [50.71992919223209]
Binary decompilation plays an important role in software security analysis, reverse engineering and malware understanding.<n>Recent advances in large language models (LLMs) have enabled neural decompilation, but the generated code is typically only semantically plausible.<n>We propose ICL4Decomp, a hybrid decompilation framework that leverages in-context learning (ICL) to guide LLMs toward generating re-executable source code.
arXiv Detail & Related papers (2025-11-03T17:21:39Z)
On Code-Induced Reasoning in LLMs [21.875805779552564]
We construct parallel instruction datasets in ten programming languages.<n>We apply controlled perturbations that selectively disrupt structural or semantic properties of code.<n>Across 3,331 experiments, our results show that LLMs are more vulnerable to structural perturbations than semantic ones.
arXiv Detail & Related papers (2025-09-25T19:57:36Z)
"Digital Camouflage": The LLVM Challenge in LLM-Based Malware Detection [0.0]
Large Language Models (LLMs) have emerged as promising tools for malware detection.<n>However, their reliability under adversarial compiler-level obfuscation is yet to be discovered.<n>This study empirically evaluate the robustness of three state-of-the-art LLMs against compiler-level obfuscation techniques.
arXiv Detail & Related papers (2025-09-20T12:47:36Z)
Training Language Models to Generate Quality Code with Program Analysis Feedback [66.0854002147103]
Code generation with large language models (LLMs) is increasingly adopted in production but fails to ensure code quality.<n>We propose REAL, a reinforcement learning framework that incentivizes LLMs to generate production-quality code.
arXiv Detail & Related papers (2025-05-28T17:57:47Z)
Deconstructing Obfuscation: A four-dimensional framework for evaluating Large Language Models assembly code deobfuscation capabilities [0.49157446832511503]
Large language models (LLMs) have shown promise in software engineering, yet their effectiveness for binary analysis remains unexplored.<n>We present the first comprehensive evaluation of commercial LLMs for assembly code deobfuscation.
arXiv Detail & Related papers (2025-05-26T12:16:44Z)
Program Semantic Inequivalence Game with Large Language Models [20.43560028315856]
Large Language Models (LLMs) can achieve strong performance on everyday coding tasks, but they can fail on complex tasks that require non-trivial reasoning about program semantics.<n>In this work, we explore a method to synthetically generate code reasoning training data based on a semantic inequivalence game SInQ.<n>We prove that this setup enables theoretically unlimited improvement through self-play in the limit of infinite computational resources.
arXiv Detail & Related papers (2025-05-02T20:03:35Z)
The Code Barrier: What LLMs Actually Understand? [7.407441962359689]
This research uses code obfuscation as a structured testing framework to evaluate semantic understanding capabilities of language models.<n>Findings show a statistically significant performance decline as obfuscation complexity increases.<n>This research introduces a new evaluation approach for assessing code comprehension in language models.
arXiv Detail & Related papers (2025-04-14T14:11:26Z)
ObscuraCoder: Powering Efficient Code LM Pre-Training Via Obfuscation Grounding [60.37988508851391]
Language models (LMs) have become a staple of the code-writing toolbox.<n>Research exploring modifications to Code-LMs' pre-training objectives, geared towards improving data efficiency and better disentangling between syntax and semantics, has been noticeably sparse.<n>In this work, we examine grounding on obfuscated code as a means of helping Code-LMs look beyond the surface-form syntax and enhance their pre-training sample efficiency.
arXiv Detail & Related papers (2025-03-27T23:08:53Z)
EquiBench: Benchmarking Large Language Models' Reasoning about Program Semantics via Equivalence Checking [58.15568681219339]
We introduce EquiBench, a new benchmark for evaluating large language models (LLMs)<n>This task directly tests a model's ability to reason about program semantics.<n>We evaluate 19 state-of-the-art LLMs and find that in the most challenging categories, the best accuracies are 63.8% and 76.2%, only modestly above the 50% random baseline.
arXiv Detail & Related papers (2025-02-18T02:54:25Z)
ReF Decompile: Relabeling and Function Call Enhanced Decompile [50.86228893636785]
The goal of decompilation is to convert compiled low-level code (e.g., assembly code) back into high-level programming languages.<n>This task supports various reverse engineering applications, such as vulnerability identification, malware analysis, and legacy software migration.
arXiv Detail & Related papers (2025-02-17T12:38:57Z)
Can Large Language Models Understand Symbolic Graphics Programs? [136.5639211254501]
Symbolic graphics programs are popular in computer graphics.<n>They allow us to test an LLM's ability to answer semantic questions about the images or 3D geometries without a vision encoder.<n>We create a benchmark for the semantic visual understanding of symbolic graphics programs, built procedurally with minimal human effort.<n>We evaluate commercial and open-source LLMs on our benchmark to assess their ability to reason about visual output of programs.
arXiv Detail & Related papers (2024-08-15T17:59:57Z)
Guess & Sketch: Language Model Guided Transpilation [59.02147255276078]
Learned transpilation offers an alternative to manual re-writing and engineering efforts. Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness. Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence.
arXiv Detail & Related papers (2023-09-25T15:42:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.