SemanticForge: Repository-Level Code Generation through Semantic Knowledge Graphs and Constraint Satisfaction
- URL: http://arxiv.org/abs/2511.07584v1
- Date: Wed, 12 Nov 2025 01:05:36 GMT
- Title: SemanticForge: Repository-Level Code Generation through Semantic Knowledge Graphs and Constraint Satisfaction
- Authors: Wuyang Zhang, Chenkai Zhang, Zhen Luo, Jianming Ma, Wangming Yuan, Chuqiao Gu, Chenwei Feng,
- Abstract summary: Large language models (LLMs) have transformed software development by enabling automated code generation, yet they frequently suffer from systematic errors that limit practical deployment.<n>We identify two critical failure modes: textitlogical hallucination (incorrect control/data-flow reasoning) and textitschematic hallucination (type mismatches, signature violations, and architectural inconsistencies).<n>This paper presents textbfSemanticForge, which introduces four fundamental algorithmic advances for semantically-aware code generation.
- Score: 7.46733617565624
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have transformed software development by enabling automated code generation, yet they frequently suffer from systematic errors that limit practical deployment. We identify two critical failure modes: \textit{logical hallucination} (incorrect control/data-flow reasoning) and \textit{schematic hallucination} (type mismatches, signature violations, and architectural inconsistencies). These errors stem from the absence of explicit, queryable representations of repository-wide semantics. This paper presents \textbf{SemanticForge}, which introduces four fundamental algorithmic advances for semantically-aware code generation: (1) a novel automatic reconciliation algorithm for dual static-dynamic knowledge graphs, unifying compile-time and runtime program semantics; (2) a neural approach that learns to generate structured graph queries from natural language, achieving 73\% precision versus 51\% for traditional retrieval; (3) a novel beam search algorithm with integrated SMT solving, enabling real-time constraint verification during generation rather than post-hoc validation; and (4) an incremental maintenance algorithm that updates knowledge graphs in $O(|ΔR| \cdot \log n)$ time while maintaining semantic equivalence.
Related papers
- AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms [54.99368693313797]
Existing benchmarks test only individual languages/tools, so the performance numbers are not directly comparable.<n>We address this gap with AlgoVeri, a benchmark that evaluates vericoding of $77$ classical algorithms in Dafny, Verus, and Lean.
arXiv Detail & Related papers (2026-02-10T06:58:26Z) - GRAD: Graph-Retrieved Adaptive Decoding for Hallucination Mitigation [12.69955054591315]
We introduce Graph-Retrieved Adaptive Decoding (GRAD), a decoding-time method that grounds generation in corpus-derived evidence without retraining.<n>Across three models and a range of question-answering benchmarks, GRAD consistently surpasses baselines.<n>GRAD offers a lightweight, plug-and-play alternative to contrastive decoding and knowledge graph augmentation.
arXiv Detail & Related papers (2025-11-05T22:51:16Z) - Truth-Aware Decoding: A Program-Logic Approach to Factual Language Generation [0.2864713389096699]
This paper introduces Truth-Aware Decoding (TAD), a verification-oriented decoding scheme that aligns neural language generation with knowledge bases.<n>Our contributions are fourfold: (i) a constraint-based semantics that renders oracle filtering as a program-logic judgment, (ii) a proof that greedy selection enjoys local likelihood dominance under sound and complete guards, and (iii) an entropy-style invariant that quantifies factual risk via knowledge-aware safe mass.
arXiv Detail & Related papers (2025-10-03T22:11:15Z) - SLICET5: Static Program Slicing using Language Models with Copy Mechanism and Constrained Decoding [13.61350801915956]
Static program slicing is a fundamental technique in software engineering.<n>ourtool is a novel slicing framework that reformulates static program slicing as a sequence-to-sequence task.<n>ourtool consistently outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2025-09-22T03:14:47Z) - Align-GRAG: Reasoning-Guided Dual Alignment for Graph Retrieval-Augmented Generation [79.75818239774952]
Large language models (LLMs) have demonstrated remarkable capabilities, but still struggle with issues like hallucinations and outdated information.<n>Retrieval-augmented generation (RAG) addresses these issues by grounding LLM outputs in external knowledge with an Information Retrieval (IR) system.<n>We propose Align-GRAG, a novel reasoning-guided dual alignment framework in post-retrieval phrase.
arXiv Detail & Related papers (2025-05-22T05:15:27Z) - DeepRTL: Bridging Verilog Understanding and Generation with a Unified Representation Model [13.532046953850902]
We present DeepRTL, a unified representation model that excels in both Verilog understanding and generation.<n>Based on CodeT5+, DeepRTL is fine-tuned on a comprehensive dataset that aligns Verilog code with rich, multi-level natural language descriptions.<n>We introduce the first benchmark for Verilog understanding and take the initiative to apply embedding similarity and GPT Score to evaluate the models' understanding capabilities.
arXiv Detail & Related papers (2025-02-20T11:07:55Z) - NAMER: Non-Autoregressive Modeling for Handwritten Mathematical Expression Recognition [80.22784377150465]
Handwritten Mathematical Expression Recognition (HMER) has gained considerable attention in pattern recognition for its diverse applications in document understanding.
This paper makes the first attempt to build a novel bottom-up Non-AutoRegressive Modeling approach for HMER, called NAMER.
NAMER comprises a Visual Aware Tokenizer (VAT) and a Parallel Graph (PGD)
arXiv Detail & Related papers (2024-07-16T04:52:39Z) - Momentum Decoding: Open-ended Text Generation As Graph Exploration [49.812280360794894]
Open-ended text generation with autoregressive language models (LMs) is one of the core tasks in natural language processing.
We formulate open-ended text generation from a new perspective, i.e., we view it as an exploration process within a directed graph.
We propose a novel decoding method -- textitmomentum decoding -- which encourages the LM to explore new nodes outside the current graph.
arXiv Detail & Related papers (2022-12-05T11:16:47Z) - Software Vulnerability Detection via Deep Learning over Disaggregated
Code Graph Representation [57.92972327649165]
This work explores a deep learning approach to automatically learn the insecure patterns from code corpora.
Because code naturally admits graph structures with parsing, we develop a novel graph neural network (GNN) to exploit both the semantic context and structural regularity of a program.
arXiv Detail & Related papers (2021-09-07T21:24:36Z) - Structure-Augmented Text Representation Learning for Efficient Knowledge
Graph Completion [53.31911669146451]
Human-curated knowledge graphs provide critical supportive information to various natural language processing tasks.
These graphs are usually incomplete, urging auto-completion of them.
graph embedding approaches, e.g., TransE, learn structured knowledge via representing graph elements into dense embeddings.
textual encoding approaches, e.g., KG-BERT, resort to graph triple's text and triple-level contextualized representations.
arXiv Detail & Related papers (2020-04-30T13:50:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.