When Dataflow Analysis Meets Large Language Models
- URL: http://arxiv.org/abs/2402.10754v1
- Date: Fri, 16 Feb 2024 15:21:35 GMT
- Title: When Dataflow Analysis Meets Large Language Models
- Authors: Chengpeng Wang, Wuqi Zhang, Zian Su, Xiangzhe Xu, Xiaoheng Xie,
Xiangyu Zhang
- Abstract summary: This paper introduces LLMDFA, an LLM-powered dataflow analysis framework that analyzes arbitrary code snippets without requiring a compilation infrastructure.
Inspired by summary-based dataflow analysis, LLMDFA decomposes the problem into three sub-problems, which are effectively resolved by several essential strategies.
Our evaluation has shown that the design can mitigate the hallucination and improve the reasoning ability, obtaining high precision and recall in detecting dataflow-related bugs.
- Score: 9.458251511218817
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Dataflow analysis is a powerful code analysis technique that reasons
dependencies between program values, offering support for code optimization,
program comprehension, and bug detection. Existing approaches require the
successful compilation of the subject program and customizations for downstream
applications. This paper introduces LLMDFA, an LLM-powered dataflow analysis
framework that analyzes arbitrary code snippets without requiring a compilation
infrastructure and automatically synthesizes downstream applications. Inspired
by summary-based dataflow analysis, LLMDFA decomposes the problem into three
sub-problems, which are effectively resolved by several essential strategies,
including few-shot chain-of-thought prompting and tool synthesis. Our
evaluation has shown that the design can mitigate the hallucination and improve
the reasoning ability, obtaining high precision and recall in detecting
dataflow-related bugs upon benchmark programs, outperforming state-of-the-art
(classic) tools, including a very recent industrial analyzer.
Related papers
- Data Analysis in the Era of Generative AI [56.44807642944589]
This paper explores the potential of AI-powered tools to reshape data analysis, focusing on design considerations and challenges.
We explore how the emergence of large language and multimodal models offers new opportunities to enhance various stages of data analysis workflow.
We then examine human-centered design principles that facilitate intuitive interactions, build user trust, and streamline the AI-assisted analysis workflow across multiple apps.
arXiv Detail & Related papers (2024-09-27T06:31:03Z) - What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions.
We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types.
We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z) - Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs)
Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws.
Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z) - LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit [55.73370804397226]
Quantization, a key compression technique, can effectively mitigate these demands by compressing and accelerating large language models.
We present LLMC, a plug-and-play compression toolkit, to fairly and systematically explore the impact of quantization.
Powered by this versatile toolkit, our benchmark covers three key aspects: calibration data, algorithms (three strategies), and data formats.
arXiv Detail & Related papers (2024-05-09T11:49:05Z) - A Thorough Examination of Decoding Methods in the Era of LLMs [72.65956436513241]
Decoding methods play an indispensable role in converting language models from next-token predictors into practical task solvers.
This paper provides a comprehensive and multifaceted analysis of various decoding methods within the context of large language models.
Our findings reveal that decoding method performance is notably task-dependent and influenced by factors such as alignment, model size, and quantization.
arXiv Detail & Related papers (2024-02-10T11:14:53Z) - The Hitchhiker's Guide to Program Analysis: A Journey with Large
Language Models [18.026567399243]
Large Language Models (LLMs) offer a promising alternative to static analysis.
In this paper, we take a deep dive into the open space of LLM-assisted static analysis.
We develop LLift, a fully automated framework that interfaces with both a static analysis tool and an LLM.
arXiv Detail & Related papers (2023-08-01T02:57:43Z) - FuzzyFlow: Leveraging Dataflow To Find and Squash Program Optimization
Bugs [92.47146416628965]
FuzzyFlow is a fault localization and test case extraction framework designed to test program optimizations.
We leverage dataflow program representations to capture a fully reproducible system state and area-of-effect for optimizations.
To reduce testing time, we design an algorithm for minimizing test inputs, trading off memory for recomputation.
arXiv Detail & Related papers (2023-06-28T13:00:17Z) - Computing unsatisfiable cores for LTLf specifications [3.251765107970636]
Linear-time temporal logic on finite traces (LTLf) is rapidly becoming a de-facto standard to produce specifications in many application domains.
We provide four algorithms for extracting an unsatisfiable core using state-of-the-art approaches to satisfiability checking.
The results show the feasibility, effectiveness, and complementarities of the different algorithms and tools.
arXiv Detail & Related papers (2022-03-09T16:08:43Z) - Indexing Context-Sensitive Reachability [16.114012813668932]
textscFlare is a reduction from the CFL reachability problem to the conventional graph reachability problem for context-sensitive data flow analysis.
We have applied our reduction to a context-sensitive alias analysis and a context-sensitive information-flow analysis for C/C++ programs.
arXiv Detail & Related papers (2021-09-03T05:41:30Z) - Comparative Code Structure Analysis using Deep Learning for Performance
Prediction [18.226950022938954]
This paper aims to assess the feasibility of using purely static information (e.g., abstract syntax tree or AST) of applications to predict performance change based on the change in code structure.
Our evaluations of several deep embedding learning methods demonstrate that tree-based Long Short-Term Memory (LSTM) models can leverage the hierarchical structure of source-code to discover latent representations and achieve up to 84% (individual problem) and 73% (combined dataset with multiple of problems) accuracy in predicting the change in performance.
arXiv Detail & Related papers (2021-02-12T16:59:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.