SLICET5: Static Program Slicing using Language Models with Copy Mechanism and Constrained Decoding
- URL: http://arxiv.org/abs/2509.17338v1
- Date: Mon, 22 Sep 2025 03:14:47 GMT
- Title: SLICET5: Static Program Slicing using Language Models with Copy Mechanism and Constrained Decoding
- Authors: Pengfei He, Shaowei Wang, Tse-Hsun Chen,
- Abstract summary: Static program slicing is a fundamental technique in software engineering.<n>ourtool is a novel slicing framework that reformulates static program slicing as a sequence-to-sequence task.<n>ourtool consistently outperforms state-of-the-art baselines.
- Score: 13.61350801915956
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Static program slicing is a fundamental technique in software engineering. Traditional static slicing tools rely on parsing complete source code, which limits their applicability to real-world scenarios where code snippets are incomplete or unparsable. While recent research developed learning-based approaches to predict slices, they face critical challenges: (1) Inaccurate dependency identification, where models fail to precisely capture data and control dependencies between code elements; and (2) Unconstrained generation, where models produce slices with extraneous or hallucinated tokens not present in the input, violating the structural integrity of slices. To address these challenges, we propose \ourtool, a novel slicing framework that reformulates static program slicing as a sequence-to-sequence task using lightweight language models (e.g., CodeT5+). Our approach incorporates two key innovations. First, we introduce a copy mechanism that enables the model to more accurately capture inter-element dependencies and directly copy relevant tokens from the input, improving both dependency reasoning and generation constraint. Second, we design a constrained decoding process with (a) lexical constraint, restricting outputs to input tokens only, and (b) syntactic constraint, leveraging Tree Similarity of Edit Distance (TSED) monotonicity to detect structurally invalid outputs and discard them. We evaluate \ourtool on CodeNet and LeetCode datasets and show it consistently outperforms state-of-the-art baselines, improving ExactMatch scores by up to 27\%. Furthermore, \ourtool demonstrates strong performance on incomplete code, highlighting its robustness and practical utility in real-world development environments.
Related papers
- Readability-Robust Code Summarization via Meta Curriculum Learning [53.44612630063336]
In the real world, code is often poorly structured or obfuscated, significantly degrading model performance.<n>We propose RoFTCodeSum, a novel fine-tuning method that enhances the robustness of code summarization against poorly readable code.
arXiv Detail & Related papers (2026-01-09T02:38:24Z) - zkStruDul: Programming zkSNARKs with Structural Duality [0.2449909275410287]
zkStruDul is a language that unifies input transformations and predicate definitions into a single combined abstraction.<n>We provide a source-level semantics and prove its behavior is identical to the projected semantics, allowing straightforward standard reasoning.
arXiv Detail & Related papers (2025-11-13T18:06:21Z) - SemanticForge: Repository-Level Code Generation through Semantic Knowledge Graphs and Constraint Satisfaction [7.46733617565624]
Large language models (LLMs) have transformed software development by enabling automated code generation, yet they frequently suffer from systematic errors that limit practical deployment.<n>We identify two critical failure modes: textitlogical hallucination (incorrect control/data-flow reasoning) and textitschematic hallucination (type mismatches, signature violations, and architectural inconsistencies).<n>This paper presents textbfSemanticForge, which introduces four fundamental algorithmic advances for semantically-aware code generation.
arXiv Detail & Related papers (2025-11-10T19:53:23Z) - Every Step Counts: Decoding Trajectories as Authorship Fingerprints of dLLMs [63.82840470917859]
We show that the decoding mechanism of dLLMs can be used as a powerful tool for model attribution.<n>We propose a novel information extraction scheme called the Directed Decoding Map (DDM), which captures structural relationships between decoding steps and better reveals model-specific behaviors.
arXiv Detail & Related papers (2025-10-02T06:25:10Z) - SLICEMATE: Accurate and Scalable Static Program Slicing via LLM-Powered Agents [11.069304685402642]
SliceMate is a novel static program slicing solution powered by Large Language Model (LLM) agents.<n>It bypasses the need for explicit dependency graph construction and achieves superior slicing accuracy.<n>For rigorous evaluation, we construct a new and high-quality benchmark, SliceBench, comprising 2,200 manually annotated Java and Python programs.
arXiv Detail & Related papers (2025-07-25T04:51:47Z) - Combining Constrained and Unconstrained Decoding via Boosting: BoostCD and Its Application to Information Extraction [11.996681571362744]
Boosted Constrained Decoding combines constrained and unconstrained decoding in two phases.<n>We demonstrate the power of BoostCD by applying it to closed information extraction.
arXiv Detail & Related papers (2025-06-17T18:16:17Z) - Training Language Models to Generate Quality Code with Program Analysis Feedback [66.0854002147103]
Code generation with large language models (LLMs) is increasingly adopted in production but fails to ensure code quality.<n>We propose REAL, a reinforcement learning framework that incentivizes LLMs to generate production-quality code.
arXiv Detail & Related papers (2025-05-28T17:57:47Z) - Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling [90.86991492288487]
evaluating constraint on every token can be prohibitively expensive.<n> LCD can distort the global distribution over strings, sampling tokens based only on local information.<n>We show that our approach is superior to state-of-the-art baselines.
arXiv Detail & Related papers (2025-04-07T18:30:18Z) - COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement [80.18490952057125]
Iterative refinement has emerged as an effective paradigm for enhancing the capabilities of large language models (LLMs) on complex tasks.
We propose Context-Wise Order-Agnostic Language Modeling (COrAL) to overcome these challenges.
Our approach models multiple token dependencies within manageable context windows, enabling the model to perform iterative refinement internally.
arXiv Detail & Related papers (2024-10-12T23:56:19Z) - Neural Program Repair with Program Dependence Analysis and Effective
Filter Mechanism [37.70518599085677]
We present a novel neural program repair framework called approach, which adapts the general pre-trained language model for fixing single-line Java bugs.
We make the first attempt to use program slicing to extract contextual information directly related to the given buggy statement as repair ingredients from the corresponding program dependence graph.
We demonstrate the effectiveness of approach on five benchmarks when compared with state-of-the-art baselines.
arXiv Detail & Related papers (2023-05-16T09:43:04Z) - A Syntax-Guided Multi-Task Learning Approach for Turducken-Style Code
Generation [19.489202790935902]
We propose a syntax-guided multi-task learning approach TurduckenGen.
Specifically, we first explicitly append the type information to the code tokens to capture the representation of syntactic constraints.
Then we formalize code generation with syntactic constraint representation as an auxiliary task to enable the model to learn the syntactic constraints of the code.
arXiv Detail & Related papers (2023-03-09T06:22:07Z) - Precise Learning of Source Code Contextual Semantics via Hierarchical
Dependence Structure and Graph Attention Networks [28.212889828892664]
We propose a novel source code model embedded with hierarchical dependencies.
We introduce the syntactic structural of the basic block, i.e., its corresponding AST, in source code model to provide sufficient information.
The results show that our model reduces the scale of parameters by 50% and achieves 4% improvement on accuracy on program classification task.
arXiv Detail & Related papers (2021-11-20T04:03:42Z) - Lexically Aware Semi-Supervised Learning for OCR Post-Correction [90.54336622024299]
Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents.
Previous work has demonstrated the utility of neural post-correction methods on recognition of less-well-resourced languages.
We present a semi-supervised learning method that makes it possible to utilize raw images to improve performance.
arXiv Detail & Related papers (2021-11-04T04:39:02Z) - POINTER: Constrained Progressive Text Generation via Insertion-based
Generative Pre-training [93.79766670391618]
We present POINTER, a novel insertion-based approach for hard-constrained text generation.
The proposed method operates by progressively inserting new tokens between existing tokens in a parallel manner.
The resulting coarse-to-fine hierarchy makes the generation process intuitive and interpretable.
arXiv Detail & Related papers (2020-05-01T18:11:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.