Related papers: Can Large Language Models Simulate Symbolic Execution Output Like KLEE?

Can Large Language Models Simulate Symbolic Execution Output Like KLEE?

URL: http://arxiv.org/abs/2511.08530v1
Date: Wed, 12 Nov 2025 02:03:24 GMT
Title: Can Large Language Models Simulate Symbolic Execution Output Like KLEE?
Authors: Rong Feng, Vanisha Gupta, Vivek Patel, Viroopaksh Reddy Ernampati, Suman Saha,
Abstract summary: Symbolic execution helps check programs by exploring different paths based on symbolic inputs.<n>One of KLEE's biggest issues is how slow it can get when programs have lots of branching paths.<n>We tested whether GPT-4o could predict the KLEE outputs and the most complex path using a dataset of 100 C programs.
Score: 1.167405291587978
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Symbolic execution helps check programs by exploring different paths based on symbolic inputs. Tools like KLEE are commonly used because they can automatically detect bugs and create test cases. But one of KLEE's biggest issues is how slow it can get when programs have lots of branching paths-it often becomes too resource-heavy to run on large or complex code. In this project, we wanted to see if a large language model like GPT-4o could simulate the kinds of outputs that KLEE generates. The idea was to explore whether LLMs could one day replace parts of symbolic execution to save time and resources. One specific goal was to have GPT-4o identify the most constrained path in a program, this is the execution path with the most symbolic conditions. These paths are especially important because they often represent edge cases that are harder to test and more likely to contain deep bugs. However, figuring this out usually requires fully running KLEE, which can be expensive. So, we tested whether GPT-4o could predict the KLEE outputs and the most complex path using a dataset of 100 C programs. Our results showed about 20% accuracy in generating KLEE-like outputs and identifying the most constrained path. While not highly accurate, this early work helps show what current LLMs can and can't do when it comes to simulating symbolic execution.

Related papers

Can Large Language Models Solve Path Constraints in Symbolic Execution? [32.45630887872447]
We focus on investigating the possibility of adopting large language models (LLM) for path constraint solving.<n>We build new evaluation pipelines and benchmarks for two tasks: test case generation and path classification.<n>Our experiment results show that state-of-the-art LLMs are able to solve path constraints in both generation and classification tasks.
arXiv Detail & Related papers (2025-11-23T04:54:48Z)
Gistify! Codebase-Level Understanding via Runtime Execution [97.63528391679807]
We propose Gistify, a task where a coding LLM must create a single, minimal, self-contained file that can reproduce a specific functionality.<n>We show that current state-of-the-art models struggle to reliably solve Gistify tasks, especially ones with long executions.
arXiv Detail & Related papers (2025-10-30T17:58:26Z)
Generating and Understanding Tests via Path-Aware Symbolic Execution with LLMs [8.828992823055]
PALM is a test generation system that combines symbolic path enumeration with LLM-assisted test generation.<n>Palm helps users better understand path coverage and identify which paths are actually exercised by PALM-generated tests.
arXiv Detail & Related papers (2025-06-24T03:46:16Z)
Empc: Effective Path Prioritization for Symbolic Execution with Path Cover [4.247960711260534]
Symbolic execution can formally reason the correctness of program behaviors and detect software bugs.<n>It suffers from an inherent limitation: path explosion.<n>We propose a novel and effective path prioritization technique with path cover, named Empc.
arXiv Detail & Related papers (2025-05-06T14:08:36Z)
ThrowBench: Benchmarking LLMs by Predicting Runtime Exceptions [4.852619858744873]
Large Language Models (LLMs) have shown astounding capabilities of code understanding and synthesis.<n>We introduce ThrowBench, a benchmark consisting of over 2,400 short user-written programs written in four different programming languages.<n>We evaluate our benchmark on six state-of-the-art code LLMs and see modest performance ranging from 19 to 38% (F1 score)
arXiv Detail & Related papers (2025-03-06T09:22:23Z)
Can LLM Generate Regression Tests for Software Commits? [15.653758694625854]
Large Language Models (LLMs) have shown tremendous promise in automated software engineering.<n>We implement Cleverest, a feedback-directed zero-shot LLM-based regression test generation technique.<n>For programs using more human-readable file formats, like XML or JavaScript, Cleverest performed very well.
arXiv Detail & Related papers (2025-01-19T15:46:26Z)
Python Symbolic Execution with LLM-powered Code Generation [40.906079949304726]
Symbolic execution is a key technology in software testing, which generates test cases by collecting symbolic path constraints. Symbolic execution has been proven helpful in generating high-coverage test cases, but its limitations prevent it from broader usage in software testing. We propose an agent, LLM-Sym, that automatically calls an SMT solver, Z3, to solve execution path constraints.
arXiv Detail & Related papers (2024-09-14T02:43:20Z)
NExT: Teaching Large Language Models to Reason about Code Execution [50.93581376646064]
Large language models (LLMs) of code are typically trained on the surface textual form of programs. We propose NExT, a method to teach LLMs to inspect the execution traces of programs and reason about their run-time behavior.
arXiv Detail & Related papers (2024-04-23T01:46:32Z)
Chain of Code: Reasoning with a Language Model-Augmented Code Emulator [115.16975276693267]
We propose Chain of Code, a simple yet surprisingly effective extension that improves LM code-driven reasoning. The key idea is to encourage LMs to format semantic sub-tasks in a program as flexible pseudocode that the interpreter can explicitly catch.
arXiv Detail & Related papers (2023-12-07T17:51:43Z)
Guess & Sketch: Language Model Guided Transpilation [59.02147255276078]
Learned transpilation offers an alternative to manual re-writing and engineering efforts. Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness. Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence.
arXiv Detail & Related papers (2023-09-25T15:42:18Z)
LEVER: Learning to Verify Language-to-Code Generation with Execution [64.36459105535]
We propose LEVER, a simple approach to improve language-to-code generation by learning to verify the generated programs with their execution results. Specifically, we train verifiers to determine whether a program sampled from the LLMs is correct or not based on the natural language input, the program itself and its execution results. LEVER consistently improves over the base code LLMs(4.6% to 10.9% with code-davinci) and achieves new state-of-the-art results on all of them.
arXiv Detail & Related papers (2023-02-16T18:23:22Z)
PAL: Program-aided Language Models [112.94785609781503]
We present Program-Aided Language models (PaL) to understand natural language problems. PaL offloads the solution step to a programmatic runtime such as a Python interpreter. We set new state-of-the-art results in all 12 benchmarks.
arXiv Detail & Related papers (2022-11-18T18:56:13Z)
Fault-Aware Neural Code Rankers [64.41888054066861]
We propose fault-aware neural code rankers that can predict the correctness of a sampled program without executing it. Our fault-aware rankers can significantly increase the pass@1 accuracy of various code generation models.
arXiv Detail & Related papers (2022-06-04T22:01:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.