Can LLMs Formally Reason as Abstract Interpreters for Program Analysis?
- URL: http://arxiv.org/abs/2503.12686v1
- Date: Sun, 16 Mar 2025 23:05:52 GMT
- Title: Can LLMs Formally Reason as Abstract Interpreters for Program Analysis?
- Authors: Jacqueline L. Mitchell, Brian Hyeongseok Kim, Chenyu Zhou, Chao Wang,
- Abstract summary: LLMs have impressive capabilities in code generation and comprehension, but their potential in program analysis remains under-explored.<n>We systematically investigate whether LLMs can reason about programs using a program analysis framework called abstract interpretation.<n>We validate our approach using state-of-the-art LLMs on 22 challenging benchmark programs from the Software Verification Competition (SV-COMP) 2019 dataset.<n>Our results show that our strategies are able to elicit abstract interpretation-based reasoning in the tested models, but LLMs are susceptible to logical errors.
- Score: 2.520295252080748
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: LLMs have demonstrated impressive capabilities in code generation and comprehension, but their potential in being able to perform program analysis in a formal, automatic manner remains under-explored. To that end, we systematically investigate whether LLMs can reason about programs using a program analysis framework called abstract interpretation. We prompt LLMs to follow two different strategies, denoted as Compositional and Fixed Point Equation, to formally reason in the style of abstract interpretation, which has never been done before to the best of our knowledge. We validate our approach using state-of-the-art LLMs on 22 challenging benchmark programs from the Software Verification Competition (SV-COMP) 2019 dataset, widely used in program analysis. Our results show that our strategies are able to elicit abstract interpretation-based reasoning in the tested models, but LLMs are susceptible to logical errors, especially while interpreting complex program structures, as well as general hallucinations. This highlights key areas for improvement in the formal reasoning capabilities of LLMs.
Related papers
- Can LLMs Reason About Program Semantics? A Comprehensive Evaluation of LLMs on Formal Specification Inference [0.9319432628663639]
Large Language Models (LLMs) are increasingly being used to automate programming tasks.<n>This paper introduces FormalBench, a benchmark designed to evaluate LLMs' reasoning abilities on program semantics.<n>Using this benchmark, we evaluated the ability of LLMs in synthesizing consistent and complete specifications.
arXiv Detail & Related papers (2025-02-22T13:27:31Z) - Can Large Language Models Understand Symbolic Graphics Programs? [136.5639211254501]
Symbolic graphics programs are popular in computer graphics.<n>We create a benchmark for the semantic visual understanding of symbolic graphics programs.<n>We find that LLMs considered stronger at reasoning generally perform better.
arXiv Detail & Related papers (2024-08-15T17:59:57Z) - Large Language Models are Interpretable Learners [53.56735770834617]
In this paper, we show a combination of Large Language Models (LLMs) and symbolic programs can bridge the gap between expressiveness and interpretability.
The pretrained LLM with natural language prompts provides a massive set of interpretable modules that can transform raw input into natural language concepts.
As the knowledge learned by LSP is a combination of natural language descriptions and symbolic rules, it is easily transferable to humans (interpretable) and other LLMs.
arXiv Detail & Related papers (2024-06-25T02:18:15Z) - Can LLM Graph Reasoning Generalize beyond Pattern Memorization? [46.93972334344908]
We evaluate whether large language models (LLMs) can go beyond semantic, numeric, structural, reasoning patterns in the synthetic training data and improve utility on real-world graph-based tasks.
We find that while post-training alignment is most promising for real-world tasks, empowering LLM graph reasoning to go beyond pattern remains an open research question.
arXiv Detail & Related papers (2024-06-23T02:59:15Z) - A Survey Study on the State of the Art of Programming Exercise Generation using Large Language Models [0.0]
This paper analyzes Large Language Models (LLMs) with regard to their programming exercise generation capabilities.
Through a survey study, we defined the state of the art, extracted their strengths and weaknesses and proposed an evaluation matrix.
arXiv Detail & Related papers (2024-05-30T15:49:34Z) - FAC$^2$E: Better Understanding Large Language Model Capabilities by Dissociating Language and Cognition [56.76951887823882]
Large language models (LLMs) are primarily evaluated by overall performance on various text understanding and generation tasks.
We present FAC$2$E, a framework for Fine-grAined and Cognition-grounded LLMs' Capability Evaluation.
arXiv Detail & Related papers (2024-02-29T21:05:37Z) - Rethinking Interpretability in the Era of Large Language Models [76.1947554386879]
Large language models (LLMs) have demonstrated remarkable capabilities across a wide array of tasks.
The capability to explain in natural language allows LLMs to expand the scale and complexity of patterns that can be given to a human.
These new capabilities raise new challenges, such as hallucinated explanations and immense computational costs.
arXiv Detail & Related papers (2024-01-30T17:38:54Z) - LLMs for Relational Reasoning: How Far are We? [8.840750655261251]
Large language models (LLMs) have revolutionized many areas by achieving state-of-the-art performance on downstream tasks.
Recent efforts have demonstrated that the LLMs are poor at solving sequential decision-making problems.
arXiv Detail & Related papers (2024-01-17T08:22:52Z) - LogicAsker: Evaluating and Improving the Logical Reasoning Ability of Large Language Models [63.14196038655506]
We introduce LogicAsker, a novel approach for evaluating and enhancing the logical reasoning capabilities of large language models (LLMs)
Our methodology reveals significant gaps in LLMs' learning of logical rules, with identified reasoning failures ranging from 29% to 90% across different models.
We leverage these findings to construct targeted demonstration examples and fine-tune data, notably enhancing logical reasoning in models like GPT-4o by up to 5%.
arXiv Detail & Related papers (2024-01-01T13:53:53Z) - Towards LogiGLUE: A Brief Survey and A Benchmark for Analyzing Logical Reasoning Capabilities of Language Models [56.34029644009297]
Large language models (LLMs) have demonstrated the ability to overcome various limitations of formal Knowledge Representation (KR) systems.
LLMs excel most in abductive reasoning, followed by deductive reasoning, while they are least effective at inductive reasoning.
We study single-task training, multi-task training, and "chain-of-thought" knowledge distillation fine-tuning technique to assess the performance of model.
arXiv Detail & Related papers (2023-10-02T01:00:50Z) - LMs: Understanding Code Syntax and Semantics for Code Analysis [25.508254718438636]
We evaluate the capabilities of large language models (LLMs) and their limitations for code analysis in software engineering.
We employ four state-of-the-art foundational models, GPT4, GPT3.5, StarCoder and CodeLlama-13b-instruct.
arXiv Detail & Related papers (2023-05-20T08:43:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.