Related papers: An Empirical Study on the Effects of System Prompts in Instruction-Tuned Models for Code Generation

An Empirical Study on the Effects of System Prompts in Instruction-Tuned Models for Code Generation

URL: http://arxiv.org/abs/2602.15228v1
Date: Mon, 16 Feb 2026 22:11:21 GMT
Title: An Empirical Study on the Effects of System Prompts in Instruction-Tuned Models for Code Generation
Authors: Zaiyu Cheng, Antonio Mastropaolo,
Abstract summary: We systematically evaluate how system prompts affect code assistant.<n>We find that increasing system-prompt constraint specificity does not monotonically improve correctness.<n>For larger code-specialized models, few-shot examples can degrade performance relative to zero-shot generation.
Score: 4.76360912129794
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Instruction-tuned Language Models (ILMs) have become essential components of modern AI systems, demonstrating exceptional versatility across natural language and reasoning tasks. Among their most impactful applications is code generation, where ILMs -- commonly referred to as Code Language Models (CLMs) -- translate human intent into executable programs. While progress has been driven by advances in scaling and training methodologies, one critical aspect remains underexplored: the impact of system prompts on both general-purpose ILMs and specialized CLMs for code generation. We systematically evaluate how system prompts of varying instructional detail, along with model scale, prompting strategy, and programming language, affect code assistant. Our experimental setting spans 360 configurations across four models, five system prompts, three prompting strategies, two languages, and two temperature settings. We find that (1) increasing system-prompt constraint specificity does not monotonically improve correctness -- prompt effectiveness is configuration-dependent and can help or hinder based on alignment with task requirements and decoding context; (2) for larger code-specialized models, few-shot examples can degrade performance relative to zero-shot generation, contrary to conventional wisdom; and (3) programming language matters, with Java exhibiting significantly greater sensitivity to system prompt variations than Python, suggesting language-specific prompt engineering strategies may be necessary.

Related papers

Not All Tokens Matter: Data-Centric Optimization for Efficient Code Summarization [46.365359894614706]
We evaluate how system prompts affect ILMs and CLMs in code generation tasks.<n>Our evaluation framework, spanning 120 model configurations, reveals that the influence of system prompts increases with model scale.<n>Java shows greater sensitivity to system prompt variations than Python.
arXiv Detail & Related papers (2026-01-28T00:45:28Z)
Cross-Lingual Prompt Steerability: Towards Accurate and Robust LLM Behavior across Languages [61.18573330164572]
System prompts provide a lightweight yet powerful mechanism for conditioning large language models (LLMs) at inference time.<n>This paper presents a comprehensive study of how different system prompts steer models toward accurate and robust cross-lingual behavior.
arXiv Detail & Related papers (2025-12-02T14:54:54Z)
Grammar-Guided Evolutionary Search for Discrete Prompt Optimisation [63.97051732013936]
We propose an evolutionary search approach to automated discrete prompt optimisation consisting of two phases.<n>In the first phase, grammar-guided genetic programming is invoked to synthesise prompt-creating programmes.<n>In the second phase, local search is applied to explore the neighbourhoods of best-performing programmes.
arXiv Detail & Related papers (2025-07-14T14:34:15Z)
What Makes Large Language Models Reason in (Multi-Turn) Code Generation? [28.614888506962988]
Chain-of-thought has established itself as a popular vehicle for improving the outputs of large language models (LLMs)<n>We investigate the effects of a wide range of prompting strategies with a focus on automatic re-prompting over multiple turns and computational requirements.<n>Our study reveals strategies that consistently improve performance across all models with small and large sampling budgets.
arXiv Detail & Related papers (2024-10-10T16:53:10Z)
CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation [58.84212778960507]
CodeGRAG builds the graphical view of code blocks based on the control flow and data flow of them to better interpret the programming domain knowledge.<n>CodeGRAG significantly improves the code generation ability of LLMs and can even offer performance gain for cross-lingual code generation.
arXiv Detail & Related papers (2024-05-03T02:48:55Z)
Mixture-of-Instructions: Aligning Large Language Models via Mixture Prompting [7.103987978402038]
We introduce a novel technique termed Mixture-of-Instructions (MoI)<n>MoI employs a strategy of instruction packing combined with diverse system prompts to boost the alignment efficiency of language models.<n>Our methodology was applied to the open-source Qwen-7B-chat model, culminating in the development of Qwen-SFT-MoI.
arXiv Detail & Related papers (2024-04-29T03:58:12Z)
Human-Centric Autonomous Systems With LLMs for User Command Reasoning [16.452638202694246]
We propose to leverage the reasoning capabilities of Large Language Models to infer system requirements from in-cabin users' commands. We confirm the general ability of LLMs to understand and reason about prompts but underline that their effectiveness is conditioned on the quality of both the LLM model and the design of appropriate sequential prompts.
arXiv Detail & Related papers (2023-11-14T14:42:28Z)
Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model [63.66204449776262]
Instruct2Act is a framework that maps multi-modal instructions to sequential actions for robotic manipulation tasks. Our approach is adjustable and flexible in accommodating various instruction modalities and input types. Our zero-shot method outperformed many state-of-the-art learning-based policies in several tasks.
arXiv Detail & Related papers (2023-05-18T17:59:49Z)
PADL: Language-Directed Physics-Based Character Control [66.517142635815]
We present PADL, which allows users to issue natural language commands for specifying high-level tasks and low-level skills that a character should perform. We show that our framework can be applied to effectively direct a simulated humanoid character to perform a diverse array of complex motor skills.
arXiv Detail & Related papers (2023-01-31T18:59:22Z)
Pre-Trained Language Models for Interactive Decision-Making [72.77825666035203]
We describe a framework for imitation learning in which goals and observations are represented as a sequence of embeddings. We demonstrate that this framework enables effective generalization across different environments. For test tasks involving novel goals or novel scenes, initializing policies with language models improves task completion rates by 43.6%.
arXiv Detail & Related papers (2022-02-03T18:55:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.