PLSemanticsBench: Large Language Models As Programming Language Interpreters
- URL: http://arxiv.org/abs/2510.03415v2
- Date: Tue, 07 Oct 2025 03:28:52 GMT
- Title: PLSemanticsBench: Large Language Models As Programming Language Interpreters
- Authors: Aditya Thimmaiah, Jiyang Zhang, Jayanth Srinivasa, Junyi Jessy Li, Milos Gligoric,
- Abstract summary: As large language models (LLMs) excel at code reasoning, a natural question arises: can an LLM execute programs (i.e., act as an interpreter) purely based on a programming language's formal semantics?<n>We study this question using the imperative language IMP, formalized via small-step operational semantics (SOS) and rewriting-based operational semantics (K-semantics)<n>We introduce three evaluation sets-Human-Written, LLM-Translated, and Fuzzer- Generated-whose difficulty is controlled by code-complexity metrics spanning the size, control-flow, and data-flow axes.
- Score: 31.611330217819713
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As large language models (LLMs) excel at code reasoning, a natural question arises: can an LLM execute programs (i.e., act as an interpreter) purely based on a programming language's formal semantics? If so, it will enable rapid prototyping of new programming languages and language features. We study this question using the imperative language IMP (a subset of C), formalized via small-step operational semantics (SOS) and rewriting-based operational semantics (K-semantics). We introduce three evaluation sets-Human-Written, LLM-Translated, and Fuzzer- Generated-whose difficulty is controlled by code-complexity metrics spanning the size, control-flow, and data-flow axes. Given a program and its semantics formalized with SOS/K-semantics, models are evaluated on three tasks ranging from coarse to fine: (1) final-state prediction, (2) semantic rule prediction, and (3) execution trace prediction. To distinguish pretraining memorization from semantic competence, we define two nonstandard semantics obtained through systematic mutations of the standard rules. Across strong code/reasoning LLMs, performance drops under nonstandard semantics despite high performance under the standard one. We further find that (i) there are patterns to different model failures, (ii) most reasoning models perform exceptionally well on coarse grained tasks involving reasoning about highly complex programs often containing nested loop depths beyond five, and surprisingly, (iii) providing formal semantics helps on simple programs but often hurts on more complex ones. Overall, the results show a promise that LLMs could serve as programming language interpreters, but points to the lack of their robust semantics understanding. We release the benchmark and the supporting code at https://github.com/EngineeringSoftware/PLSemanticsBench.
Related papers
- On Code-Induced Reasoning in LLMs [21.875805779552564]
We construct parallel instruction datasets in ten programming languages.<n>We apply controlled perturbations that selectively disrupt structural or semantic properties of code.<n>Across 3,331 experiments, our results show that LLMs are more vulnerable to structural perturbations than semantic ones.
arXiv Detail & Related papers (2025-09-25T19:57:36Z) - Re:Form -- Reducing Human Priors in Scalable Formal Software Verification with RL in LLMs: A Preliminary Study on Dafny [68.00108157244952]
Large Language Models (LLMs) trained with Reinforcement Learning (RL) face a significant challenge: their verification processes are neither reliable nor scalable.<n>A promising yet largely uncharted alternative is formal language-based reasoning.<n>Grounding LLMs in rigorous formal systems where generative models operate in formal language spaces (e.g., Dafny) enables the automatic and mathematically provable verification of their reasoning processes and outcomes.
arXiv Detail & Related papers (2025-07-22T08:13:01Z) - Type-Constrained Code Generation with Language Models [51.03439021895432]
We introduce a type-constrained decoding approach that leverages type systems to guide code generation.<n>For this purpose, we develop novel prefix automata and a search over inhabitable types, forming a sound approach to enforce well-typedness on LLM-generated code.<n>Our approach reduces compilation errors by more than half and significantly increases functional correctness in code synthesis, translation, and repair tasks.
arXiv Detail & Related papers (2025-04-12T15:03:00Z) - Can LLMs Reason About Program Semantics? A Comprehensive Evaluation of LLMs on Formal Specification Inference [0.9319432628663639]
Large Language Models (LLMs) are increasingly being used to automate programming tasks.<n>This paper introduces FormalBench, a benchmark designed to evaluate LLMs' reasoning abilities on program semantics.<n>Using this benchmark, we evaluated the ability of LLMs in synthesizing consistent and complete specifications.
arXiv Detail & Related papers (2025-02-22T13:27:31Z) - EquiBench: Benchmarking Large Language Models' Reasoning about Program Semantics via Equivalence Checking [58.15568681219339]
We introduce EquiBench, a new benchmark for evaluating large language models (LLMs)<n>This task directly tests a model's ability to reason about program semantics.<n>We evaluate 19 state-of-the-art LLMs and find that in the most challenging categories, the best accuracies are 63.8% and 76.2%, only modestly above the 50% random baseline.
arXiv Detail & Related papers (2025-02-18T02:54:25Z) - Large Language Models are Interpretable Learners [53.56735770834617]
In this paper, we show a combination of Large Language Models (LLMs) and symbolic programs can bridge the gap between expressiveness and interpretability.
The pretrained LLM with natural language prompts provides a massive set of interpretable modules that can transform raw input into natural language concepts.
As the knowledge learned by LSP is a combination of natural language descriptions and symbolic rules, it is easily transferable to humans (interpretable) and other LLMs.
arXiv Detail & Related papers (2024-06-25T02:18:15Z) - Synthetic Programming Elicitation for Text-to-Code in Very Low-Resource Programming and Formal Languages [21.18996339478024]
We introduce emphsynthetic programming elicitation and compilation (SPEAC)
SPEAC produces syntactically correct programs more frequently and without sacrificing semantic correctness.
We empirically evaluate the performance of SPEAC in a case study for the UCLID5 formal verification language.
arXiv Detail & Related papers (2024-06-05T22:16:19Z) - CodeMind: Evaluating Large Language Models for Code Reasoning [6.819757372634151]
Large Language Models (LLMs) have been widely used to automate programming tasks.<n>This paper introduces CodeMind, a framework designed to gauge the code reasoning abilities of LLMs.
arXiv Detail & Related papers (2024-02-15T02:24:46Z) - Coupling Large Language Models with Logic Programming for Robust and
General Reasoning from Text [5.532477732693001]
We show that a large language model can serve as a highly effective few-shot semantically.
It can convert natural language sentences into a logical form that serves as input for answer set programs.
We demonstrate that this method achieves state-of-the-art performance on several benchmarks, including bAbI, StepGame, CLUTRR, and gSCAN.
arXiv Detail & Related papers (2023-07-15T03:29:59Z) - Language Models of Code are Few-Shot Commonsense Learners [106.1531522893209]
Given a natural language input, the goal is to generate a graph such as an event -- or a reasoning-graph.
Existing approaches serialize the output graph as a flat list of nodes and edges.
We show that when we instead frame structured commonsense reasoning tasks as code generation tasks, pre-trained LMs of code are better structured commonsense reasoners than LMs of natural language.
arXiv Detail & Related papers (2022-10-13T16:09:36Z) - Synchromesh: Reliable code generation from pre-trained language models [38.15391794443022]
We propose Synchromesh: a framework for substantially improving the reliability of pre-trained models for code generation.
First, it retrieves few-shot examples from a training bank using Target Similarity Tuning (TST), a novel method for semantic example selection.
Then, Synchromesh feeds the examples to a pre-trained language model and samples programs using Constrained Semantic Decoding (CSD), a general framework for constraining the output to a set of valid programs in the target language.
arXiv Detail & Related papers (2022-01-26T22:57:44Z) - Few-Shot Semantic Parsing with Language Models Trained On Code [52.23355024995237]
We find that Codex performs better at semantic parsing than equivalent GPT-3 models.
We find that unlike GPT-3, Codex performs similarly when targeting meaning representations directly, perhaps as meaning representations used in semantic parsing are structured similar to code.
arXiv Detail & Related papers (2021-12-16T08:34:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.