Binary Code Summarization: Benchmarking ChatGPT/GPT-4 and Other Large
Language Models
- URL: http://arxiv.org/abs/2312.09601v1
- Date: Fri, 15 Dec 2023 08:32:28 GMT
- Title: Binary Code Summarization: Benchmarking ChatGPT/GPT-4 and Other Large
Language Models
- Authors: Xin Jin, Jonathan Larson, Weiwei Yang, Zhiqiang Lin
- Abstract summary: This study delves into the potential of large language models (LLMs) for binary code comprehension.
We present BinSum, a comprehensive benchmark and dataset of over 557K binary functions.
We also propose a new semantic similarity metric that surpasses traditional exact-match approaches.
- Score: 37.8941430624661
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Binary code summarization, while invaluable for understanding code semantics,
is challenging due to its labor-intensive nature. This study delves into the
potential of large language models (LLMs) for binary code comprehension. To
this end, we present BinSum, a comprehensive benchmark and dataset of over 557K
binary functions and introduce a novel method for prompt synthesis and
optimization. To more accurately gauge LLM performance, we also propose a new
semantic similarity metric that surpasses traditional exact-match approaches.
Our extensive evaluation of prominent LLMs, including ChatGPT, GPT-4, Llama 2,
and Code Llama, reveals 10 pivotal insights. This evaluation generates 4
billion inference tokens, incurred a total expense of 11,418 US dollars and 873
NVIDIA A100 GPU hours. Our findings highlight both the transformative potential
of LLMs in this field and the challenges yet to be overcome.
Related papers
- Crystal: Illuminating LLM Abilities on Language and Code [58.5467653736537]
We propose a pretraining strategy to enhance the integration of natural language and coding capabilities.
The resulting model, Crystal, demonstrates remarkable capabilities in both domains.
arXiv Detail & Related papers (2024-11-06T10:28:46Z) - Source Code Summarization in the Era of Large Language Models [23.715005053430957]
Large language models (LLMs) have led to a great boost in the performance of code-related tasks.
In this paper, we undertake a systematic and comprehensive study on code summarization in the era of LLMs.
arXiv Detail & Related papers (2024-07-09T05:48:42Z) - What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions.
We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types.
We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z) - Uncovering LLM-Generated Code: A Zero-Shot Synthetic Code Detector via Code Rewriting [78.48355455324688]
We propose a novel zero-shot synthetic code detector based on the similarity between the code and its rewritten variants.
Our results demonstrate a notable enhancement over existing synthetic content detectors designed for general texts.
arXiv Detail & Related papers (2024-05-25T08:57:28Z) - How Far Have We Gone in Binary Code Understanding Using Large Language Models [51.527805834378974]
We propose a benchmark to evaluate the effectiveness of Large Language Models (LLMs) in binary code understanding.
Our evaluations reveal that existing LLMs can understand binary code to a certain extent, thereby improving the efficiency of binary code analysis.
arXiv Detail & Related papers (2024-04-15T14:44:08Z) - InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models [56.723509505549536]
InfiBench is the first large-scale freeform question-answering (QA) benchmark for code to our knowledge.
It comprises 234 carefully selected high-quality Stack Overflow questions that span across 15 programming languages.
We conduct a systematic evaluation for over 100 latest code LLMs on InfiBench, leading to a series of novel and insightful findings.
arXiv Detail & Related papers (2024-03-11T02:06:30Z) - Bridging Code Semantic and LLMs: Semantic Chain-of-Thought Prompting for
Code Generation [22.219645213202178]
This paper proposes the "Semantic Chain-of-Thought" approach to intruduce semantic information of code, named SeCoT.
We show that SeCoT can achieves state-of-the-art performance, greatly improving the potential for large models and code generation.
arXiv Detail & Related papers (2023-10-16T05:09:58Z) - Lost in Translation: A Study of Bugs Introduced by Large Language Models
while Translating Code [5.915447908295047]
We present a large-scale empirical study to investigate the ability of general LLMs and code LLMs for code translation.
Our study involves the translation of 1,700 code samples from three benchmarks and two real-world projects.
We find that correct translations range from 2.1% to 47.3% for the studied LLMs.
arXiv Detail & Related papers (2023-08-06T13:33:13Z) - Towards Generating Functionally Correct Code Edits from Natural Language
Issue Descriptions [11.327913840111378]
We introduce Defects4J-NL2Fix, a dataset of 283 Java programs from the popular Defects4J dataset augmented with high-level descriptions of bug fixes.
We empirically evaluate the performance of several state-of-the-art LLMs for the this task.
arXiv Detail & Related papers (2023-04-07T18:58:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.