Related papers: Measuring LLM Code Generation Stability via Structural Entropy

Measuring LLM Code Generation Stability via Structural Entropy

URL: http://arxiv.org/abs/2508.14288v1
Date: Tue, 19 Aug 2025 22:07:12 GMT
Title: Measuring LLM Code Generation Stability via Structural Entropy
Authors: Yewei Song, Tiezhu Sun, Xunzhu Tang, Prateek Rajput, Tegawende F. Bissyande, Jacques Klein,
Abstract summary: We extend "structural-entropy concepts" to the program domain by pairing entropy with abstract syntax tree (AST) analysis.<n>We measure stability in two complementary ways: (i) Jensen-Shannon divergence, a symmetric, bounded indicator of structural overlap, and (ii) a Structural Cross-Entropy ratio that highlights missing high-probability patterns.<n>Unlike pass@k, BLEU, or CodeBLEU, our metrics are reference-free, language-agnostic, and execution-independent.
Score: 4.812266013066678
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Assessing the stability of code generation from large language models (LLMs) is essential for judging their reliability in real-world development. We extend prior "structural-entropy concepts" to the program domain by pairing entropy with abstract syntax tree (AST) analysis. For any fixed prompt, we collect the multiset of depth-bounded subtrees of AST in each generated program and treat their relative frequencies as a probability distribution. We then measure stability in two complementary ways: (i) Jensen-Shannon divergence, a symmetric, bounded indicator of structural overlap, and (ii) a Structural Cross-Entropy ratio that highlights missing high-probability patterns. Both metrics admit structural-only and token-aware variants, enabling separate views on control-flow shape and identifier-level variability. Unlike pass@k, BLEU, or CodeBLEU, our metrics are reference-free, language-agnostic, and execution-independent. We benchmark several leading LLMs on standard code generation tasks, demonstrating that AST-driven structural entropy reveals nuances in model consistency and robustness. The method runs in O(n,d) time with no external tests, providing a lightweight addition to the code-generation evaluation toolkit.

Related papers

UniT: Unified Multimodal Chain-of-Thought Test-time Scaling [85.590774707406]
Unified models can handle both multimodal understanding and generation within a single architecture, yet they typically operate in a single pass without iteratively refining their outputs.<n>We introduce UniT, a framework for multimodal test-time scaling that enables a single unified model to reason, verify, and refine across multiple rounds.
arXiv Detail & Related papers (2026-02-12T18:59:49Z)
Prism: Efficient Test-Time Scaling via Hierarchical Search and Self-Verification for Discrete Diffusion Language Models [96.0074341403456]
Inference-time compute has re-emerged as a practical way to improve LLM reasoning.<n>Most test-time scaling (TTS) algorithms rely on autoregressive decoding.<n>We propose Prism, an efficient TTS framework for dLLMs.
arXiv Detail & Related papers (2026-02-02T09:14:51Z)
Trust in One Round: Confidence Estimation for Large Language Models via Structural Signals [13.89434979851652]
Large language models (LLMs) are increasingly deployed in domains where errors carry high social, scientific, or safety costs.<n>We present Structural Confidence, a single-pass, model-agnostic framework that enhances output correctness prediction.
arXiv Detail & Related papers (2026-02-01T02:35:59Z)
Task-Awareness Improves LLM Generations and Uncertainty [48.857040212979484]
Bayes-optimal responses consistently outperform standard decoding methods like beam search.<n>Our decision-theoretic framework is applicable to any problem that admits a latent response structure.
arXiv Detail & Related papers (2026-01-29T10:16:23Z)
UniDiff: A Unified Diffusion Framework for Multimodal Time Series Forecasting [90.47915032778366]
We propose UniDiff, a unified diffusion framework for multimodal time series forecasting.<n>At its core lies a unified and parallel fusion module, where a single cross-attention mechanism integrates structural information from timestamps and semantic context from texts.<n>Experiments on real-world benchmark datasets across eight domains demonstrate that the proposed UniDiff model achieves state-of-the-art performance.
arXiv Detail & Related papers (2025-12-08T05:36:14Z)
STED and Consistency Scoring: A Framework for Evaluating LLM Structured Output Reliability [11.095198847819573]
Large Language Models (LLMs) are increasingly deployed for structured data generation.<n>We introduce a comprehensive framework for evaluating and improving consistency in LLM-generated structured outputs.
arXiv Detail & Related papers (2025-11-27T02:49:52Z)
Dynamic Stability of LLM-Generated Code [6.120340803716395]
Current evaluations of LLMs for code generation overlook the fact that functionally correct solutions can differ significantly in algorithmic complexity.<n>We introduce a principled framework for evaluating the dynamic stability of generated code.<n>Our findings call for stability-aware objectives in code generation and new benchmarks with test cases for robust, real-world evaluation.
arXiv Detail & Related papers (2025-11-07T09:58:06Z)
Learning Discrete Bayesian Networks with Hierarchical Dirichlet Shrinkage [52.914168158222765]
We detail a comprehensive Bayesian framework for learning DBNs.<n>We give a novel Markov chain Monte Carlo (MCMC) algorithm utilizing parallel Langevin proposals to generate exact posterior samples.<n>We apply our methodology to uncover prognostic network structure from primary breast cancer samples.
arXiv Detail & Related papers (2025-09-16T17:24:35Z)
SEED: A Structural Encoder for Embedding-Driven Decoding in Time Series Prediction with LLMs [3.036179638516407]
We introduce SEED, a structural encoder for embedding-driven decoding, which integrates four stages: a token-aware encoder for patch extraction, a projection module that aligns patches with language model embeddings, and a semantic reprogramming mechanism that maps patches to task-aware prototypes.<n>This modular architecture decouples representation learning from inference, enabling efficient alignment between numerical patterns and semantic reasoning.
arXiv Detail & Related papers (2025-06-25T06:40:14Z)
ConsistencyChecker: Tree-based Evaluation of LLM Generalization Capabilities [14.13459302125202]
evaluating consistency in large language models (LLMs) is crucial for ensuring reliability.<n>Traditional self-consistency methods often miss subtle semantic changes in natural language and functional shifts in code or equations.<n>We propose ConsistencyChecker, a tree-based evaluation framework designed to measure consistency through sequences of reversible transformations.
arXiv Detail & Related papers (2025-06-14T07:18:33Z)
Inv-Entropy: A Fully Probabilistic Framework for Uncertainty Quantification in Language Models [5.6672926445919165]
Large language models (LLMs) have transformed natural language processing, but their reliable deployment requires effective uncertainty quantification (UQ)<n>Existing UQ methods are often and lack a probabilistic foundation.<n>We propose a fully probabilistic framework based on an inverse model, which quantifies uncertainty by evaluating the diversity of the input space conditioned on a given output through systematic perturbations.
arXiv Detail & Related papers (2025-06-11T13:02:17Z)
Enhancing LLMs for Time Series Forecasting via Structure-Guided Cross-Modal Alignment [12.319685395140862]
We propose a framework that exploits and aligns the state-transition graph structures shared by time-series and linguistic data as sequential modalities.<n> Experiments on multiple benchmarks demonstrate that SGCMA achieves state-of-the-art performance.
arXiv Detail & Related papers (2025-05-19T14:30:41Z)
Latent Semantic Consensus For Deterministic Geometric Model Fitting [109.44565542031384]
We propose an effective method called Latent Semantic Consensus (LSC) LSC formulates the model fitting problem into two latent semantic spaces based on data points and model hypotheses. LSC is able to provide consistent and reliable solutions within only a few milliseconds for general multi-structural model fitting.
arXiv Detail & Related papers (2024-03-11T05:35:38Z)
Disentanglement via Latent Quantization [60.37109712033694]
In this work, we construct an inductive bias towards encoding to and decoding from an organized latent space. We demonstrate the broad applicability of this approach by adding it to both basic data-re (vanilla autoencoder) and latent-reconstructing (InfoGAN) generative models.
arXiv Detail & Related papers (2023-05-28T06:30:29Z)
GFlowNet-EM for learning compositional latent variable models [115.96660869630227]
A key tradeoff in modeling the posteriors over latents is between expressivity and tractable optimization. We propose the use of GFlowNets, algorithms for sampling from an unnormalized density. By training GFlowNets to sample from the posterior over latents, we take advantage of their strengths as amortized variational algorithms.
arXiv Detail & Related papers (2023-02-13T18:24:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.