Related papers: E&V: Prompting Large Language Models to Perform Static Analysis by Pseudo-code Execution and Verification

E&V: Prompting Large Language Models to Perform Static Analysis by Pseudo-code Execution and Verification

URL: http://arxiv.org/abs/2312.08477v1
Date: Wed, 13 Dec 2023 19:31:00 GMT
Title: E&V: Prompting Large Language Models to Perform Static Analysis by Pseudo-code Execution and Verification
Authors: Yu Hao, Weiteng Chen, Ziqiao Zhou, Weidong Cui
Abstract summary: Large Language Models (LLMs) offer new capabilities for software engineering tasks. LLMs simulate the execution of pseudo-code, effectively conducting static analysis encoded in the pseudo-code with minimal human effort. E&V includes a verification process for pseudo-code execution without needing an external oracle.
Score: 7.745665775992235
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Static analysis, the process of examining code without executing it, is crucial for identifying software issues. Yet, static analysis is hampered by its complexity and the need for customization for different targets. Traditional static analysis tools require extensive human effort and are often limited to specific target programs and programming languages. Recent advancements in Large Language Models (LLMs), such as GPT-4 and Llama, offer new capabilities for software engineering tasks. However, their application in static analysis, especially in understanding complex code structures, remains under-explored. This paper introduces a novel approach named E&V , which leverages LLMs to perform static analysis. Specifically, E&V employs LLMs to simulate the execution of pseudo-code, effectively conducting static analysis encoded in the pseudo-code with minimal human effort, thereby improving the accuracy of results. E&V includes a verification process for pseudo-code execution without needing an external oracle. This process allows E&V to mitigate hallucinations of LLMs and enhance the accuracy of static analysis results. We have implemented E&V in a prototype tool designed for triaging crashes through backward taint analysis. This prototype, paired with GPT-4-32k, has been applied to triage 170 recently fixed Linux kernel bugs across seven bug categories. Our experiments demonstrate that the prototype correctly identifies the blamed function in 81.2% of the cases. Additionally, we observe that our novel verification process significantly improves the accuracy, increasing it from 28.2% to 81.2%.

Related papers

The Hitchhiker's Guide to Program Analysis, Part II: Deep Thoughts by LLMs [17.497629884237647]
BugLens is a post-refinement framework that significantly improves static analysis precision. It raises precision from 0.10 (raw) and 0.50 (semi-automated refinement) to 0.72, substantially reducing false positives. Our results suggest that a structured LLM-based workflow can meaningfully enhance the effectiveness of static analysis tools.
arXiv Detail & Related papers (2025-04-16T02:17:06Z)
KNighter: Transforming Static Analysis with LLM-Synthesized Checkers [14.02595288424478]
KNighter generates high-precision checkers capable of detecting diverse bug patterns. To date, KNighter-synthesized checkers have discovered 92 new, critical, long-latent bugs in the Linux kernel.
arXiv Detail & Related papers (2025-03-12T02:30:19Z)
Automated Refactoring of Non-Idiomatic Python Code: A Differentiated Replication with LLMs [54.309127753635366]
We present the results of a replication study in which we investigate GPT-4 effectiveness in recommending and suggesting idiomatic actions. Our findings underscore the potential of LLMs to achieve tasks where, in the past, implementing recommenders based on complex code analyses was required.
arXiv Detail & Related papers (2025-01-28T15:41:54Z)
LLMSA: A Compositional Neuro-Symbolic Approach to Compilation-free and Customizable Static Analysis [13.993290878789779]
We propose a compositional neuro-symbolic approach for compilation-free, customizable static analysis with reduced hallucinations. It attains 66.27% precision and 78.57% recall in taint vulnerability detection, surpassing an industrial approach in F1 score by 0.20.
arXiv Detail & Related papers (2024-12-18T23:14:59Z)
SpecTool: A Benchmark for Characterizing Errors in Tool-Use LLMs [77.79172008184415]
SpecTool is a new benchmark to identify error patterns in LLM output on tool-use tasks. We show that even the most prominent LLMs exhibit these error patterns in their outputs. Researchers can use the analysis and insights from SPECTOOL to guide their error mitigation strategies.
arXiv Detail & Related papers (2024-11-20T18:56:22Z)
Easing Maintenance of Academic Static Analyzers [0.0]
Mopsa is a static analysis platform that aims at being sound. This article documents the tools and techniques we have come up with to simplify the maintenance of Mopsa since 2017.
arXiv Detail & Related papers (2024-07-17T11:29:21Z)
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions [72.56339136017759]
We introduce BigCodeBench, a benchmark that challenges Large Language Models (LLMs) to invoke multiple function calls as tools from 139 libraries and 7 domains for 1,140 fine-grained tasks. Our evaluation shows that LLMs are not yet capable of following complex instructions to use function calls precisely, with scores up to 60%, significantly lower than the human performance of 97%. We propose a natural-language-oriented variant of BigCodeBench, BigCodeBench-Instruct, that automatically transforms the original docstrings into short instructions only with essential information.
arXiv Detail & Related papers (2024-06-22T15:52:04Z)
Customizing Static Analysis using Codesearch [1.7205106391379021]
A commonly used language to describe a range of static analysis applications is Datalog. We aim to make building custom static analysis tools much easier for developers, while at the same time providing a familiar framework for application security and static analysis experts. Our approach introduces a language called StarLang, a variant of Datalog which only includes programs with a fast runtime.
arXiv Detail & Related papers (2024-04-19T09:50:02Z)
ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks. To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z)
Leveraging Large Language Models for Automated Proof Synthesis in Rust [6.202137610101939]
Large Language Models (LLMs) have shown success in code analysis and synthesis. We present a combination of LLMs and static analysis to synthesize invariants, assertions, and other proof structures for a Rust-based formal verification framework called Verus. Our prototype decomposes the verification task into multiple smaller ones, iteratively queries GPT-4, and combines its output with lightweight static analysis.
arXiv Detail & Related papers (2023-11-07T05:47:47Z)
DyVal: Dynamic Evaluation of Large Language Models for Reasoning Tasks [112.66827096358857]
We introduce DyVal, a protocol for dynamic evaluation of large language models (LLMs) Based on our framework, we build graph-informed DyVal by leveraging the structural advantage of directed acyclic graphs. We evaluate various LLMs ranging from Flan-T5-large to GPT-3.5-Turbo and GPT-4.
arXiv Detail & Related papers (2023-09-29T12:04:14Z)
The Hitchhiker's Guide to Program Analysis: A Journey with Large Language Models [18.026567399243]
Large Language Models (LLMs) offer a promising alternative to static analysis. In this paper, we take a deep dive into the open space of LLM-assisted static analysis. We develop LLift, a fully automated framework that interfaces with both a static analysis tool and an LLM.
arXiv Detail & Related papers (2023-08-01T02:57:43Z)
A Static Evaluation of Code Completion by Large Language Models [65.18008807383816]
Execution-based benchmarks have been proposed to evaluate functional correctness of model-generated code on simple programming problems. static analysis tools such as linters, which can detect errors without running the program, haven't been well explored for evaluating code generation models. We propose a static evaluation framework to quantify static errors in Python code completions, by leveraging Abstract Syntax Trees.
arXiv Detail & Related papers (2023-06-05T19:23:34Z)
Malware Classification Using Static Disassembly and Machine Learning [1.5469452301122177]
We propose four easy-to-extract and small-scale features, including sizes and permissions of Windows PE sections, content, and import libraries, to classify malware families. Compared with detailed behavior-related features like API sequences, proposed features provide macroscopic information about malware. We show that the novel proposed features together with a classical machine learning algorithm (Random Forest) presents very good accuracy at 99.40%.
arXiv Detail & Related papers (2021-12-10T18:14:47Z)
D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential Analysis [55.15995704119158]
We propose D2A, a differential analysis based approach to label issues reported by static analysis tools. We use D2A to generate a large labeled dataset to train models for vulnerability identification.
arXiv Detail & Related papers (2021-02-16T07:46:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.