Demystifying and Assessing Code Understandability in Java Decompilation
- URL: http://arxiv.org/abs/2409.20343v1
- Date: Mon, 30 Sep 2024 14:44:00 GMT
- Title: Demystifying and Assessing Code Understandability in Java Decompilation
- Authors: Ruixin Qin, Yifan Xiong, Yifei Lu, Minxue Pan,
- Abstract summary: Decompilation, the process of converting machine-level code into readable source code, plays a critical role in reverse engineering.
We propose the first empirical study on the understandability of Java decompiled code.
- Score: 3.2671789531342457
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Decompilation, the process of converting machine-level code into readable source code, plays a critical role in reverse engineering. Given that the main purpose of decompilation is to facilitate code comprehension in scenarios where the source code is unavailable, the understandability of decompiled code is of great importance. In this paper, we propose the first empirical study on the understandability of Java decompiled code and obtained the following findings: (1) Understandability of Java decompilation is considered as important as its correctness, and decompilation understandability issues are even more commonly encountered than decompilation failures. (2) A notable percentage of code snippets decompiled by Java decompilers exhibit significantly lower or higher levels of understandability in comparison to their original source code. (3) Unfortunately, Cognitive Complexity demonstrates relatively acceptable precision while low recall in recognizing these code snippets exhibiting diverse understandability during decompilation. (4) Even worse, perplexity demonstrates lower levels of precision and recall in recognizing such code snippets. Inspired by the four findings, we further proposed six code patterns and the first metric for the assessment of decompiled code understandability. This metric was extended from Cognitive Complexity, with six more rules harvested from an exhaustive manual analysis into 1287 pairs of source code snippets and corresponding decompiled code. This metric was also validated using the original and updated dataset, yielding an impressive macro F1-score of 0.88 on the original dataset, and 0.86 on the test set.
Related papers
- Understanding Code Understandability Improvements in Code Reviews [79.16476505761582]
We analyzed 2,401 code review comments from Java open-source projects on GitHub.
83.9% of suggestions for improvement were accepted and integrated, with fewer than 1% later reverted.
arXiv Detail & Related papers (2024-10-29T12:21:23Z) - CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution [50.7413285637879]
The CRUXEVAL-X code reasoning benchmark contains 19 programming languages.
It comprises at least 600 subjects for each language, along with 19K content-consistent tests in total.
Even a model trained solely on Python can achieve at most 34.4% Pass@1 in other languages.
arXiv Detail & Related papers (2024-08-23T11:43:00Z) - WaDec: Decompiling WebAssembly Using Large Language Model [5.667013605202579]
WebAssembly (abbreviated Wasm) has emerged as a cornerstone of web development.
Despite its advantages, Wasm's binary nature presents significant challenges for developers and researchers.
We introduce a novel approach, WaDec, which is the first use of a fine-tuned LLM to interpret and decompile Wasm binary code.
arXiv Detail & Related papers (2024-06-17T09:08:30Z) - Uncovering LLM-Generated Code: A Zero-Shot Synthetic Code Detector via Code Rewriting [78.48355455324688]
We propose a novel zero-shot synthetic code detector based on the similarity between the code and its rewritten variants.
Our results demonstrate a notable enhancement over existing synthetic content detectors designed for general texts.
arXiv Detail & Related papers (2024-05-25T08:57:28Z) - How Far Have We Gone in Binary Code Understanding Using Large Language Models [51.527805834378974]
We propose a benchmark to evaluate the effectiveness of Large Language Models (LLMs) in binary code understanding.
Our evaluations reveal that existing LLMs can understand binary code to a certain extent, thereby improving the efficiency of binary code analysis.
arXiv Detail & Related papers (2024-04-15T14:44:08Z) - SparseCoder: Identifier-Aware Sparse Transformer for File-Level Code
Summarization [51.67317895094664]
This paper studies file-level code summarization, which can assist programmers in understanding and maintaining large source code projects.
We propose SparseCoder, an identifier-aware sparse transformer for effectively handling long code sequences.
arXiv Detail & Related papers (2024-01-26T09:23:27Z) - Binary Code Summarization: Benchmarking ChatGPT/GPT-4 and Other Large
Language Models [37.8941430624661]
This study delves into the potential of large language models (LLMs) for binary code comprehension.
We present BinSum, a comprehensive benchmark and dataset of over 557K binary functions.
We also propose a new semantic similarity metric that surpasses traditional exact-match approaches.
arXiv Detail & Related papers (2023-12-15T08:32:28Z) - On the Relationship between Code Verifiability and Understandability [2.5728707125824735]
Proponents of software verification have argued that simpler code is easier to verify.
We compare the number of warnings produced by four state-of-the-art verification tools on 211 snippets of Java code with 20 metrics of code comprehensibility from human subjects.
arXiv Detail & Related papers (2023-10-31T03:54:35Z) - Improving type information inferred by decompilers with supervised
machine learning [0.0]
In software reverse engineering, decompilation is the process of recovering source code from binary files.
We build different classification models capable of inferring the high-level type returned by functions.
Our system is able to predict function return types with a 79.1% F1-measure, whereas the best decompiler obtains a 30% F1-measure.
arXiv Detail & Related papers (2021-01-19T11:45:46Z) - A Transformer-based Approach for Source Code Summarization [86.08359401867577]
We learn code representation for summarization by modeling the pairwise relationship between code tokens.
We show that despite the approach is simple, it outperforms the state-of-the-art techniques by a significant margin.
arXiv Detail & Related papers (2020-05-01T23:29:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.