Related papers: Binary Code Summarization: Benchmarking ChatGPT/GPT-4 and Other Large Language Models

Binary Code Summarization: Benchmarking ChatGPT/GPT-4 and Other Large Language Models

URL: http://arxiv.org/abs/2312.09601v1
Date: Fri, 15 Dec 2023 08:32:28 GMT
Title: Binary Code Summarization: Benchmarking ChatGPT/GPT-4 and Other Large Language Models
Authors: Xin Jin, Jonathan Larson, Weiwei Yang, Zhiqiang Lin
Abstract summary: This study delves into the potential of large language models (LLMs) for binary code comprehension. We present BinSum, a comprehensive benchmark and dataset of over 557K binary functions. We also propose a new semantic similarity metric that surpasses traditional exact-match approaches.
Score: 37.8941430624661
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Binary code summarization, while invaluable for understanding code semantics, is challenging due to its labor-intensive nature. This study delves into the potential of large language models (LLMs) for binary code comprehension. To this end, we present BinSum, a comprehensive benchmark and dataset of over 557K binary functions and introduce a novel method for prompt synthesis and optimization. To more accurately gauge LLM performance, we also propose a new semantic similarity metric that surpasses traditional exact-match approaches. Our extensive evaluation of prominent LLMs, including ChatGPT, GPT-4, Llama 2, and Code Llama, reveals 10 pivotal insights. This evaluation generates 4 billion inference tokens, incurred a total expense of 11,418 US dollars and 873 NVIDIA A100 GPU hours. Our findings highlight both the transformative potential of LLMs in this field and the challenges yet to be overcome.

Related papers

An Empirical Study on the Effectiveness of Large Language Models for Binary Code Understanding [50.17907898478795]
This work proposes a benchmark to evaluate the effectiveness of Large Language Models (LLMs) in real-world reverse engineering scenarios. Our evaluations reveal that existing LLMs can understand binary code to a certain extent, thereby improving the efficiency of binary code analysis.
arXiv Detail & Related papers (2025-04-30T17:02:06Z)
Evaluating and Aligning CodeLLMs on Human Preference [42.26173776584043]
We present a rigorous human-curated benchmark CodeArena to emulate the complexity and diversity of real-world coding tasks. We also propose a diverse synthetic instruction corpus SynCode-Instruct to verify the effectiveness of the large-scale synthetic instruction fine-tuning. The results find performance differences between execution-based benchmarks and CodeArena.
arXiv Detail & Related papers (2024-12-06T17:40:38Z)
Crystal: Illuminating LLM Abilities on Language and Code [58.5467653736537]
We propose a pretraining strategy to enhance the integration of natural language and coding capabilities. The resulting model, Crystal, demonstrates remarkable capabilities in both domains.
arXiv Detail & Related papers (2024-11-06T10:28:46Z)
Source Code Summarization in the Era of Large Language Models [23.715005053430957]
Large language models (LLMs) have led to a great boost in the performance of code-related tasks. In this paper, we undertake a systematic and comprehensive study on code summarization in the era of LLMs.
arXiv Detail & Related papers (2024-07-09T05:48:42Z)
What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions. We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types. We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z)
Uncovering LLM-Generated Code: A Zero-Shot Synthetic Code Detector via Code Rewriting [78.48355455324688]
We propose a novel zero-shot synthetic code detector based on the similarity between the code and its rewritten variants. Our results demonstrate a notable enhancement over existing synthetic content detectors designed for general texts.
arXiv Detail & Related papers (2024-05-25T08:57:28Z)
How Far Have We Gone in Binary Code Understanding Using Large Language Models [51.527805834378974]
We propose a benchmark to evaluate the effectiveness of Large Language Models (LLMs) in binary code understanding. Our evaluations reveal that existing LLMs can understand binary code to a certain extent, thereby improving the efficiency of binary code analysis.
arXiv Detail & Related papers (2024-04-15T14:44:08Z)
InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models [56.723509505549536]
InfiBench is the first large-scale freeform question-answering (QA) benchmark for code to our knowledge. It comprises 234 carefully selected high-quality Stack Overflow questions that span across 15 programming languages. We conduct a systematic evaluation for over 100 latest code LLMs on InfiBench, leading to a series of novel and insightful findings.
arXiv Detail & Related papers (2024-03-11T02:06:30Z)
Bridging Code Semantic and LLMs: Semantic Chain-of-Thought Prompting for Code Generation [22.219645213202178]
This paper proposes the "Semantic Chain-of-Thought" approach to intruduce semantic information of code, named SeCoT. We show that SeCoT can achieves state-of-the-art performance, greatly improving the potential for large models and code generation.
arXiv Detail & Related papers (2023-10-16T05:09:58Z)
Lost in Translation: A Study of Bugs Introduced by Large Language Models while Translating Code [5.915447908295047]
We present a large-scale empirical study to investigate the ability of general LLMs and code LLMs for code translation. Our study involves the translation of 1,700 code samples from three benchmarks and two real-world projects. We find that correct translations range from 2.1% to 47.3% for the studied LLMs.
arXiv Detail & Related papers (2023-08-06T13:33:13Z)
Towards Generating Functionally Correct Code Edits from Natural Language Issue Descriptions [11.327913840111378]
We introduce Defects4J-NL2Fix, a dataset of 283 Java programs from the popular Defects4J dataset augmented with high-level descriptions of bug fixes. We empirically evaluate the performance of several state-of-the-art LLMs for the this task.
arXiv Detail & Related papers (2023-04-07T18:58:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.