Not All Tokens Matter: Data-Centric Optimization for Efficient Code Summarization
- URL: http://arxiv.org/abs/2601.20147v1
- Date: Wed, 28 Jan 2026 00:45:28 GMT
- Title: Not All Tokens Matter: Data-Centric Optimization for Efficient Code Summarization
- Authors: Saima Afrin, Zaiyu Cheng, Tushar Sharma, Alexander Serebrenik, Massimiliano Di Penta, Antonio Mastropaolo,
- Abstract summary: We evaluate how system prompts affect ILMs and CLMs in code generation tasks.<n>Our evaluation framework, spanning 120 model configurations, reveals that the influence of system prompts increases with model scale.<n>Java shows greater sensitivity to system prompt variations than Python.
- Score: 46.365359894614706
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Instruction-tuned Language Models ILMs have become essential components of modern AI systems, demonstrating exceptional versatility across a wide range of natural language and reasoning tasks. Among their most impactful applications is code generation, where ILMs--commonly referred to as Code Language Models CLMs--have demonstrated remarkable capability. This strength stems from their defining feature: the use of explicit task instructions during fine-tuning, which enables them to bridge natural language and code by translating human intent into executable code. While much of their progress has been driven by advances in scaling laws and training methodologies, one critical aspect remains underexplored--the impact of system prompts on the performance of both general-purpose ILMs and specialized CLMs when instantiated to assist users with code generation activities. In this study, we take a first step toward bridging this gap by systematically evaluating how system prompts of varying instructional detail, along with model scale, prompting strategy, and programming language, affect ILMs and CLMs in code generation tasks. Our evaluation framework, spanning 120 model configurations, reveals that (1) the influence of system prompts increases with model scale; (2) few-shot prompting reduces this effect compared to zero-shot; and (3) programming language matters, with Java showing greater sensitivity to system prompt variations than Python.
Related papers
- Code Fingerprints: Disentangled Attribution of LLM-Generated Code [7.515488307576106]
We study the problem of model-level code attribution, which aims to determine the source LLM responsible for generated code.<n>We propose the Disentangled Code Attribution Network (DCAN), which separates Source-Agnostic semantic information from Source-Specific stylistic representations.<n>We construct the first large-scale benchmark dataset comprising code generated by four widely used Large Language Models (LLMs) across four programming languages.
arXiv Detail & Related papers (2026-03-04T15:58:36Z) - An Empirical Study on the Effects of System Prompts in Instruction-Tuned Models for Code Generation [4.76360912129794]
We systematically evaluate how system prompts affect code assistant.<n>We find that increasing system-prompt constraint specificity does not monotonically improve correctness.<n>For larger code-specialized models, few-shot examples can degrade performance relative to zero-shot generation.
arXiv Detail & Related papers (2026-02-16T22:11:21Z) - On Code-Induced Reasoning in LLMs [21.875805779552564]
We construct parallel instruction datasets in ten programming languages.<n>We apply controlled perturbations that selectively disrupt structural or semantic properties of code.<n>Across 3,331 experiments, our results show that LLMs are more vulnerable to structural perturbations than semantic ones.
arXiv Detail & Related papers (2025-09-25T19:57:36Z) - Crystal: Illuminating LLM Abilities on Language and Code [58.5467653736537]
We propose a pretraining strategy to enhance the integration of natural language and coding capabilities.
The resulting model, Crystal, demonstrates remarkable capabilities in both domains.
arXiv Detail & Related papers (2024-11-06T10:28:46Z) - Large Language Models are Interpretable Learners [53.56735770834617]
In this paper, we show a combination of Large Language Models (LLMs) and symbolic programs can bridge the gap between expressiveness and interpretability.
The pretrained LLM with natural language prompts provides a massive set of interpretable modules that can transform raw input into natural language concepts.
As the knowledge learned by LSP is a combination of natural language descriptions and symbolic rules, it is easily transferable to humans (interpretable) and other LLMs.
arXiv Detail & Related papers (2024-06-25T02:18:15Z) - CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation [58.84212778960507]
CodeGRAG builds the graphical view of code blocks based on the control flow and data flow of them to better interpret the programming domain knowledge.<n>CodeGRAG significantly improves the code generation ability of LLMs and can even offer performance gain for cross-lingual code generation.
arXiv Detail & Related papers (2024-05-03T02:48:55Z) - If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code
Empowers Large Language Models to Serve as Intelligent Agents [81.60906807941188]
Large language models (LLMs) are trained on a combination of natural language and formal language (code)
Code translates high-level goals into executable steps, featuring standard syntax, logical consistency, abstraction, and modularity.
arXiv Detail & Related papers (2024-01-01T16:51:20Z) - Testing LLMs on Code Generation with Varying Levels of Prompt
Specificity [0.0]
Large language models (LLMs) have demonstrated unparalleled prowess in mimicking human-like text generation and processing.
The potential to transform natural language prompts into executable code promises a major shift in software development practices.
arXiv Detail & Related papers (2023-11-10T23:41:41Z) - From Language Modeling to Instruction Following: Understanding the Behavior Shift in LLMs after Instruction Tuning [63.63840740526497]
We investigate how instruction tuning adjusts pre-trained models with a focus on intrinsic changes.
The impact of instruction tuning is then studied by comparing the explanations derived from the pre-trained and instruction-tuned models.
Our findings reveal three significant impacts of instruction tuning.
arXiv Detail & Related papers (2023-09-30T21:16:05Z) - L2CEval: Evaluating Language-to-Code Generation Capabilities of Large
Language Models [102.00201523306986]
We present L2CEval, a systematic evaluation of the language-to-code generation capabilities of large language models (LLMs)
We analyze the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods.
In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs.
arXiv Detail & Related papers (2023-09-29T17:57:00Z) - Pre-Trained Language Models for Interactive Decision-Making [72.77825666035203]
We describe a framework for imitation learning in which goals and observations are represented as a sequence of embeddings.
We demonstrate that this framework enables effective generalization across different environments.
For test tasks involving novel goals or novel scenes, initializing policies with language models improves task completion rates by 43.6%.
arXiv Detail & Related papers (2022-02-03T18:55:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.