TreeDiff: AST-Guided Code Generation with Diffusion LLMs
- URL: http://arxiv.org/abs/2508.01473v2
- Date: Thu, 07 Aug 2025 17:46:00 GMT
- Title: TreeDiff: AST-Guided Code Generation with Diffusion LLMs
- Authors: Yiming Zeng, Jinghan Cao, Zexin Li, Yiming Chen, Tao Ren, Dawei Xiang, Xidong Wu, Shangqian Gao, Tingting Yu,
- Abstract summary: We propose a syntax-aware diffusion framework that incorporates structural priors from Abstract Syntax Trees (ASTs) into the denoising process.<n>Results demonstrate that syntax-aware corruption significantly improves syntactic correctness, reconstruction accuracy, and generalization to unseen code patterns.
- Score: 27.111814602726227
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in diffusion-based language models have opened new possibilities for controllable and bidirectional sequence generation. These models provide an alternative to traditional autoregressive approaches by framing text generation as an iterative denoising process. However, applying diffusion models to structured domains such as source code remains a significant challenge. Programming languages differ from natural language in that they follow strict syntactic and semantic rules, with hierarchical organization that must be preserved for correctness. Standard token-level corruption techniques used during training often ignore this structure, which may hinder the model's ability to learn meaningful representations of code. To address this limitation, we propose a syntax-aware diffusion framework that incorporates structural priors from Abstract Syntax Trees (ASTs) into the denoising process. Instead of masking individual tokens at random, we selectively corrupt syntactically meaningful code spans derived from AST subtrees. This enables the model to reconstruct programs in a way that respects grammatical boundaries and captures long-range dependencies. Experimental results demonstrate that syntax-aware corruption significantly improves syntactic correctness, reconstruction accuracy, and generalization to unseen code patterns. These findings highlight the potential of incorporating structural information into diffusion-based training and suggest that syntax-guided denoising is a promising direction for advancing diffusion-based language models in code generation tasks.
Related papers
- Unveiling the Potential of Diffusion Large Language Model in Controllable Generation [11.181783720439563]
Diffusion models, originally developed for image generation, have emerged as a promising alternative to autoregressive large language models (LLMs)<n>We present a theoretical analysis comparing autoregressive and masked diffusion LLMs (dLLMs)<n>We propose textbfSelf-adaptivetexttextbfSchema textbfScaf, a novel framework that enables dLLMs to generate structured outputs while maintaining semantic fidelity and accelerating inference.
arXiv Detail & Related papers (2025-07-06T18:41:34Z) - Collapsed Language Models Promote Fairness [88.48232731113306]
We find that debiased language models exhibit collapsed alignment between token representations and word embeddings.<n>We design a principled fine-tuning method that can effectively improve fairness in a wide range of debiasing methods.
arXiv Detail & Related papers (2024-10-06T13:09:48Z) - Reconsidering Degeneration of Token Embeddings with Definitions for Encoder-based Pre-trained Language Models [20.107727903240065]
We propose DefinitionEMB to re-construct isotropically distributed and semantics-related token embeddings for encoder-based language models.
Our experiments demonstrate the effectiveness of leveraging definitions from Wiktionary to re-construct such embeddings.
arXiv Detail & Related papers (2024-08-02T15:00:05Z) - CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation [58.84212778960507]
CodeGRAG builds the graphical view of code blocks based on the control flow and data flow of them to better interpret the programming domain knowledge.<n>CodeGRAG significantly improves the code generation ability of LLMs and can even offer performance gain for cross-lingual code generation.
arXiv Detail & Related papers (2024-05-03T02:48:55Z) - Scalable Learning of Latent Language Structure With Logical Offline
Cycle Consistency [71.42261918225773]
Conceptually, LOCCO can be viewed as a form of self-learning where the semantic being trained is used to generate annotations for unlabeled text.
As an added bonus, the annotations produced by LOCCO can be trivially repurposed to train a neural text generation model.
arXiv Detail & Related papers (2023-05-31T16:47:20Z) - A Cheaper and Better Diffusion Language Model with Soft-Masked Noise [62.719656543880596]
Masked-Diffuse LM is a novel diffusion model for language modeling, inspired by linguistic features in languages.
Specifically, we design a linguistic-informed forward process which adds corruptions to the text through strategically soft-masking to better noise the textual data.
We demonstrate that our Masked-Diffuse LM can achieve better generation quality than the state-of-the-art diffusion models with better efficiency.
arXiv Detail & Related papers (2023-04-10T17:58:42Z) - GanLM: Encoder-Decoder Pre-training with an Auxiliary Discriminator [114.8954615026781]
We propose a GAN-style model for encoder-decoder pre-training by introducing an auxiliary discriminator.
GanLM is trained with two pre-training objectives: replaced token detection and replaced token denoising.
Experiments in language generation benchmarks show that GanLM with the powerful language understanding capability outperforms various strong pre-trained language models.
arXiv Detail & Related papers (2022-12-20T12:51:11Z) - Autoregressive Structured Prediction with Language Models [73.11519625765301]
We describe an approach to model structures as sequences of actions in an autoregressive manner with PLMs.
Our approach achieves the new state-of-the-art on all the structured prediction tasks we looked at.
arXiv Detail & Related papers (2022-10-26T13:27:26Z) - PSSAT: A Perturbed Semantic Structure Awareness Transferring Method for
Perturbation-Robust Slot Filling [27.602336774468]
Most existing slot filling models tend to memorize inherent patterns of entities and corresponding contexts from training data.
We propose a semantic awareness structure transferring method for training perturbation-robust slot filling models.
arXiv Detail & Related papers (2022-08-24T13:01:00Z) - CDLNet: Robust and Interpretable Denoising Through Deep Convolutional
Dictionary Learning [6.6234935958112295]
Unrolled optimization networks propose an interpretable alternative to constructing deep neural networks.
We show that the proposed model outperforms the state-of-the-art denoising models when scaled to similar parameter count.
arXiv Detail & Related papers (2021-03-05T01:15:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.