Related papers: SemGuard: Real-Time Semantic Evaluator for Correcting LLM-Generated Code

SemGuard: Real-Time Semantic Evaluator for Correcting LLM-Generated Code

URL: http://arxiv.org/abs/2509.24507v1
Date: Mon, 29 Sep 2025 09:21:32 GMT
Title: SemGuard: Real-Time Semantic Evaluator for Correcting LLM-Generated Code
Authors: Qinglin Wang, Zhihong Sun, Ruyun Wang, Tao Huang, Zhi Jin, Ge Li, Chen Lyu,
Abstract summary: Post-hoc repair pipelines detect such faults only after execution.<n>We present SemGuard, a semantic-evaluator-driven framework that performs real-time, line-level semantic supervision.
Score: 46.20378145112059
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) can translate natural language requirements into code, yet empirical analyses of representative models reveal that semantic errors-programs that compile but behave incorrectly-constitute the majority of observed faults (e.g., >60% on DeepSeek-Coder-6.7B and QwenCoder-7B). Post-hoc repair pipelines detect such faults only after execution, incurring latency, relying on incomplete test suites, and often mis-localizing the defect. Since semantic drift originates in the autoregressive decoding process, intervening while the code is being generated is a direct way to stop error propagation. Constrained-decoding approaches such as ROCODE attempt this, but still wait until the entire program runs to obtain feedback and use entropy heuristics that do not truly capture semantics. A more effective solution must inject semantic signals-early and precisely-into the decoding process.We present SemGuard, a semantic-evaluator-driven framework that performs real-time, line-level semantic supervision. To train the evaluator, we build SemDiff, the first dataset with fine-grained annotations that mark the exact line where a correct and an incorrect implementation diverge. The evaluator, once embedded in the LLM's decoder, flags deviations on partial code, rolls back to the faulty line, and guides regeneration-without executing the program or requiring test cases. Across four benchmarks, SemGuard consistently outperforms state-of-the-art baselines. It lowers the semantic error rate by 19.86% on SemDiff relative to ROCODE, and lifts Pass@1 by 48.92% on the real-world LiveCodeBench with CodeLlama-7B. Similar gains hold for StarCoder2-7B on MBPP and for DeepSeekCoder-6.7B on the Java benchmark SemDiff-Java, demonstrating model- and language-agnostic effectiveness.

Related papers

SPARC: Scenario Planning and Reasoning for Automated C Unit Test Generation [1.0010193170880752]
We introduce a neuro-symbolic, scenario-based framework that bridges the gap between high-level program intent and the rigid syntactic constraints of pointer arithmetic and manual memory management.<n>We evaluate it on 59 real-world and algorithmic subjects, where it outperforms the vanilla prompt generation baseline by 31.36% in line coverage, 26.01% in branch coverage, and 20.78% in mutation score, matching or exceeding the symbolic execution tool KLEE.
arXiv Detail & Related papers (2026-02-18T18:09:03Z)
AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms [54.99368693313797]
Existing benchmarks test only individual languages/tools, so the performance numbers are not directly comparable.<n>We address this gap with AlgoVeri, a benchmark that evaluates vericoding of $77$ classical algorithms in Dafny, Verus, and Lean.
arXiv Detail & Related papers (2026-02-10T06:58:26Z)
Can LLMs Compress (and Decompress)? Evaluating Code Understanding and Execution via Invertibility [36.41073880422337]
We present RoundTripCodeEval (RTCE), a comprehensive benchmark consisting of four distinct code execution reasoning tasks.<n>We systematically evaluate state-of-the-art Code-LLMs using zero-shot prompting, supervised fine-tuning on execution traces, and self-reflection mechanisms.<n>RTCE surfaces several new and previously unmeasured insights that are not captured by existing I/O-prediction, execution-reasoning, or round-trip natural-language benchmarks.
arXiv Detail & Related papers (2026-01-19T21:09:48Z)
CARE What Fails: Contrastive Anchored-REflection for Verifiable Multimodal [84.71254539482369]
Group-relative reinforcement learning with verifiable rewards (RLVR) often wastes the most informative data it already has the failures.<n>We present CARE, a failure-centric post-training framework for multimodal reasoning that turns errors into supervision.<n> CARE improves accuracy and training smoothness while explicitly increasing the share of learning signal that comes from failures.
arXiv Detail & Related papers (2025-12-22T16:34:21Z)
Time Travel: LLM-Assisted Semantic Behavior Localization with Git Bisect [8.55768450285885]
We present a novel framework that integrates Large Language Models (LLMs) into the Git bisect process for semantic fault localization.<n>Our system augments bisect traversal with structured chain of thought reasoning, enabling commit by commit analysis under noisy conditions.
arXiv Detail & Related papers (2025-11-24T07:49:59Z)
Towards Automated Error Discovery: A Study in Conversational AI [48.735443116662026]
We introduce Automated Error Discovery, a framework for detecting and defining errors in conversational AI.<n>We also propose SEEED (Soft Clustering Extended-Based Error Detection), as an encoder-based approach to its implementation.
arXiv Detail & Related papers (2025-09-13T14:53:22Z)
Probing Pre-trained Language Models on Code Changes: Insights from ReDef, a High-Confidence Just-in-Time Defect Prediction Dataset [0.0]
We present ReDef, a high-confidence benchmark of function-level modifications curated from 22 large-scale C/C++ projects.<n>Defective cases are anchored by revert commits, while clean cases are validated through post-hoc history checks.<n>This pipeline yields 3,164 defective and 10,268 clean modifications, offering substantially more reliable labels than prior existing resources.
arXiv Detail & Related papers (2025-09-11T07:07:11Z)
Diffusion Language Models Know the Answer Before Decoding [56.96815863705218]
Diffusion language models (DLMs) have emerged as an alternative to autoregressive approaches.<n>Our work highlights and leverage an overlooked property of DLMs early answer convergence.<n>We introduce Prophet, a training-free fast decoding paradigm that enables early commit decoding.
arXiv Detail & Related papers (2025-08-27T15:40:25Z)
Repairing vulnerabilities without invisible hands. A differentiated replication study on LLMs [5.10123605644148]
Automated Vulnerability Repair (AVR) is a fast-growing branch of program repair.<n>Recent studies show that large language models (LLMs) outperform traditional techniques.
arXiv Detail & Related papers (2025-07-28T16:39:16Z)
EquiBench: Benchmarking Large Language Models' Reasoning about Program Semantics via Equivalence Checking [58.15568681219339]
We introduce EquiBench, a new benchmark for evaluating large language models (LLMs)<n>This task directly tests a model's ability to reason about program semantics.<n>We evaluate 19 state-of-the-art LLMs and find that in the most challenging categories, the best accuracies are 63.8% and 76.2%, only modestly above the 50% random baseline.
arXiv Detail & Related papers (2025-02-18T02:54:25Z)
A test-free semantic mistakes localization framework in Neural Code Translation [32.5036379897325]
We present EISP, a static analysis framework based on the Large Language Model (LLM) The framework generates a semantic mapping between source code and translated code. EISP connects each pair of sub-code fragments with fine-grained knowledge hints through an AI chain.
arXiv Detail & Related papers (2024-10-30T08:53:33Z)
Impact of Large Language Models of Code on Fault Localization [2.936007114555107]
We propose a simple but effective sequence generation approach for fine-tuning large language models of code for FL tasks. Specifically, we fine-tune representative encoder, encoder-decoder, and decoder-based 13 LLMCs for FL tasks. Experimental results show that LLMCs fine-tuned with our approach successfully pinpoint error positions in 50.6%, 64.2%, and 72.3% of 1,291 methods in Defects4J for Top-2/3/5 prediction.
arXiv Detail & Related papers (2024-08-19T02:36:07Z)
Let the Code LLM Edit Itself When You Edit the Code [50.46536185784169]
underlinetextbfPositional textbfIntegrity textbfEncoding (PIE)<n>PIE reduces computational overhead by over 85% compared to the standard full recomputation approach.<n>Results demonstrate that PIE reduces computational overhead by over 85% compared to the standard full recomputation approach.
arXiv Detail & Related papers (2024-07-03T14:34:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.