Code vs Serialized AST Inputs for LLM-Based Code Summarization: An Empirical Study
- URL: http://arxiv.org/abs/2602.06671v1
- Date: Fri, 06 Feb 2026 12:55:01 GMT
- Title: Code vs Serialized AST Inputs for LLM-Based Code Summarization: An Empirical Study
- Authors: Shijia Dong, Haoruo Zhao, Paul Harvey,
- Abstract summary: This paper presents AST(NIT), an AST augmentation and serialization method that preserves lexical details and encodes structural information into LLM-compatible sequences.<n> Experiments with the LLaMA-3.1-8B model on the CodeXGLUE Python dataset show that the proposed serialized ASTs reduce the length of LLM inputs, require shorter training times, and achieve summarization quality comparable to existing approaches.
- Score: 0.9558392439655014
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Summarizing source code into natural language descriptions (code summarization) helps developers better understand program functionality and reduce the burden of software maintenance. Abstract Syntax Trees (ASTs), as opposed to source code, have been shown to improve summarization quality in traditional encoder-decoder-based code summarization models. However, most large language model (LLM)-based code summarization methods rely on raw code or only incorporate partial AST signals, meaning that the potential of complete AST representation has not been fully explored for LLMs. This paper presents AST(NIT), an AST augmentation and serialization method that preserves lexical details and encodes structural information into LLM-compatible sequences. Experiments with the LLaMA-3.1-8B model on the CodeXGLUE Python dataset show that the proposed serialized ASTs reduce the length of LLM inputs, require shorter training times, and achieve summarization quality comparable to existing approaches.
Related papers
- Language Bottleneck Models: A Framework for Interpretable Knowledge Tracing and Beyond [55.984684518346924]
We recast Knowledge Tracing as an inverse problem: learning the minimum natural-language summary that makes past answers explainable and future answers predictable.<n>Our Language Bottleneck Model (LBM) consists of an encoder LLM that writes an interpretable knowledge summary and a frozen decoder LLM that must reconstruct and predict student responses using only that summary text.<n> Experiments on synthetic arithmetic benchmarks and the large-scale Eedi dataset show that LBMs rival the accuracy of state-of-the-art KT and direct LLM methods while requiring orders-of-magnitude fewer student trajectories.
arXiv Detail & Related papers (2025-06-20T13:21:14Z) - Post-Incorporating Code Structural Knowledge into Pretrained Models via ICL for Code Translation [15.73070332657629]
Large language models (LLMs) have achieved significant advancements in software mining.<n> handling the syntactic structure of source code remains a challenge.<n>This paper employs incontext learning (ICL) to integrate code structural knowledge into pre-trained LLMs.
arXiv Detail & Related papers (2025-03-28T10:59:42Z) - Self-Explained Keywords Empower Large Language Models for Code Generation [5.236633572296712]
Large language models (LLMs) have achieved impressive performance in code generation.
Sek(textbfSelf-textbfExplained textbfKeywords) extracts and explains the key terms in the problem description with the LLM itself.
arXiv Detail & Related papers (2024-10-21T12:52:03Z) - Source Code Summarization in the Era of Large Language Models [34.80939934434718]
Large language models (LLMs) have led to a great boost in the performance of code-related tasks.<n>In this paper, we undertake a systematic and comprehensive study on code summarization in the era of LLMs.
arXiv Detail & Related papers (2024-07-09T05:48:42Z) - InverseCoder: Self-improving Instruction-Tuned Code LLMs with Inverse-Instruct [43.7550233177368]
This paper explores whether it is possible to use a fine-tuned open-source model to generate additional data to augment its instruction-tuning dataset.<n>We propose Inverse-Instruct, a data augmentation technique that uses a fine-tuned LLM to generate additional instructions of code responses from its own training dataset.
arXiv Detail & Related papers (2024-07-08T08:00:05Z) - Text-like Encoding of Collaborative Information in Large Language Models for Recommendation [58.87865271693269]
We introduce BinLLM, a novel method to seamlessly integrate collaborative information with Large Language Models for Recommendation (LLMRec)
BinLLM converts collaborative embeddings from external models into binary sequences.
BinLLM provides options to compress the binary sequence using dot-decimal notation to avoid excessively long lengths.
arXiv Detail & Related papers (2024-06-05T12:45:25Z) - CodecLM: Aligning Language Models with Tailored Synthetic Data [51.59223474427153]
We introduce CodecLM, a framework for adaptively generating high-quality synthetic data for instruction-following abilities.
We first encode seed instructions into metadata, which are concise keywords generated on-the-fly to capture the target instruction distribution.
We also introduce Self-Rubrics and Contrastive Filtering during decoding to tailor data-efficient samples.
arXiv Detail & Related papers (2024-04-08T21:15:36Z) - StepCoder: Improve Code Generation with Reinforcement Learning from
Compiler Feedback [58.20547418182074]
We introduce StepCoder, a novel framework for code generation, consisting of two main components.
CCCS addresses the exploration challenge by breaking the long sequences code generation task into a Curriculum of Code Completion Subtasks.
FGO only optimize the model by masking the unexecuted code segments to provide Fine-Grained Optimization.
Our method improves the ability to explore the output space and outperforms state-of-the-art approaches in corresponding benchmarks.
arXiv Detail & Related papers (2024-02-02T13:14:31Z) - SparseCoder: Identifier-Aware Sparse Transformer for File-Level Code
Summarization [51.67317895094664]
This paper studies file-level code summarization, which can assist programmers in understanding and maintaining large source code projects.
We propose SparseCoder, an identifier-aware sparse transformer for effectively handling long code sequences.
arXiv Detail & Related papers (2024-01-26T09:23:27Z) - Harnessing the Zero-Shot Power of Instruction-Tuned Large Language Model in End-to-End Speech Recognition [23.172469312225694]
We propose to utilize an instruction-tuned large language model (LLM) for guiding the text generation process in automatic speech recognition (ASR)<n>The proposed model is built on the joint CTC and attention architecture, with the LLM serving as a front-end feature extractor for the decoder.<n> Experimental results show that the proposed LLM-guided model achieves a relative gain of approximately 13% in word error rates across major benchmarks.
arXiv Detail & Related papers (2023-09-19T11:10:50Z) - AST-MHSA : Code Summarization using Multi-Head Self-Attention [1.588193964339148]
We present a model, AST-MHSA, that uses multi-head attention to extract semantic information from the abstract syntax tree (AST) of the code.
The model is trained on a dataset of code and summaries, and the parameters are optimized to minimize the loss between the generated summaries and the ground-truth summaries.
arXiv Detail & Related papers (2023-08-10T15:43:46Z) - Inference with Reference: Lossless Acceleration of Large Language Models [97.04200102556551]
LLMA is an accelerator to speed up Large Language Model (LLM) inference with references.
It is motivated by the observation that there are abundant identical text spans between the decoding result by an LLM and the reference that is available in many real world scenarios.
arXiv Detail & Related papers (2023-04-10T09:55:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.