Related papers: Aligning Requirement for Large Language Model's Code Generation

Aligning Requirement for Large Language Model's Code Generation

URL: http://arxiv.org/abs/2509.01313v2
Date: Sat, 06 Sep 2025 01:50:34 GMT
Title: Aligning Requirement for Large Language Model's Code Generation
Authors: Zhao Tian, Junjie Chen,
Abstract summary: Specine is a novel specification alignment technique for large language models (LLMs) code generation.<n>Its key idea is to identify misaligned input specifications, lift LLM-perceived specifications, and align them to enhance the code generation performance of LLMs.<n>For example, Specine outperforms the most effective baseline, achieving an average improvement of 29.60% across all subjects in terms of Pass@1.
Score: 9.205909320363247
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Code generation refers to the automatic generation of source code based on a given programming specification, which has garnered significant attention particularly with the advancement of large language models (LLMs). However, due to the inherent complexity of real-world problems, the LLM-generated code often fails to fully align with the provided specification. While state-of-the-art agent-based techniques have been proposed to enhance LLM code generation, they overlook the critical issue of specification perception, resulting in persistent misalignment issues. Given that accurate perception of programming specifications serves as the foundation of the LLM-based code generation paradigm, ensuring specification alignment is particularly crucial. In this work, we draw on software requirements engineering to propose Specine, a novel specification alignment technique for LLM code generation. Its key idea is to identify misaligned input specifications, lift LLM-perceived specifications, and align them to enhance the code generation performance of LLMs. Our comprehensive experiments on four state-of-the-art LLMs across five challenging competitive benchmarks by comparing with ten state-of-the-art baselines, demonstrate the effectiveness of Specine. For example, Specine outperforms the most effective baseline, achieving an average improvement of 29.60% across all subjects in terms of Pass@1.

Related papers

Are LLMs Reliable Code Reviewers? Systematic Overcorrection in Requirement Conformance Judgement [8.059802912761919]
We uncover a systematic failure of large language models (LLMs) in matching code to natural language requirements.<n>More detailed prompt design, particularly with those requiring explanations and proposed corrections, leads to higher misjudgment rates.<n>We propose a Fix-guided Verification Filter that treats the model proposed fix as executable counterfactual evidence.
arXiv Detail & Related papers (2026-02-28T08:35:25Z)
CodeSimpleQA: Scaling Factuality in Code Large Language Models [55.705748501461294]
We present CodeSimpleQA, a comprehensive benchmark designed to evaluate the factual accuracy of code LLMs in answering code-related questions.<n>We also create CodeSimpleQA-Instruct, a large-scale instruction corpus with 66M samples, and develop a post-training framework combining supervised fine-tuning and reinforcement learning.
arXiv Detail & Related papers (2025-12-22T14:27:17Z)
LLM Agents for Generating Microservice-based Applications: how complex is your specification? [0.0]
We explore code synthesis for microservice-based applications.<n>We propose a metric for scoring the difficulty of a specification.<n>The higher the score, the more difficult it is to generate code for the specification.
arXiv Detail & Related papers (2025-08-22T12:34:22Z)
Uncovering Systematic Failures of LLMs in Verifying Code Against Natural Language Specifications [0.6813925418351435]
Large language models (LLMs) have become essential tools in software development, widely used for requirements engineering, code generation and review tasks.<n>In this paper, we uncover a systematic failure of LLMs in evaluating whether code aligns with natural language requirements.<n>Our results reveal that LLMs frequently misclassify correct code implementations as either not satisfying requirements'' or containing potential defects.
arXiv Detail & Related papers (2025-08-17T13:07:26Z)
IFEvalCode: Controlled Code Generation [69.28317223249358]
The paper introduces forward and backward constraints generation to improve the instruction-following capabilities of Code LLMs.<n>The authors present IFEvalCode, a multilingual benchmark comprising 1.6K test samples across seven programming languages.
arXiv Detail & Related papers (2025-07-30T08:08:48Z)
Exploring Code Language Models for Automated HLS-based Hardware Generation: Benchmark, Infrastructure and Analysis [14.458529723566379]
Large language models (LLMs) can be employed for programming languages such as Python and C++.<n>This paper explores leveraging LLMs to generate High-Level Synthesis (HLS)-based hardware design.
arXiv Detail & Related papers (2025-02-19T17:53:59Z)
CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code [56.019447113206006]
Large Language Models (LLMs) have achieved remarkable progress in code generation.<n>CodeIP is a novel multi-bit watermarking technique that inserts additional information to preserve provenance details.<n>Experiments conducted on a real-world dataset across five programming languages demonstrate the effectiveness of CodeIP.
arXiv Detail & Related papers (2024-04-24T04:25:04Z)
CodecLM: Aligning Language Models with Tailored Synthetic Data [51.59223474427153]
We introduce CodecLM, a framework for adaptively generating high-quality synthetic data for instruction-following abilities. We first encode seed instructions into metadata, which are concise keywords generated on-the-fly to capture the target instruction distribution. We also introduce Self-Rubrics and Contrastive Filtering during decoding to tailor data-efficient samples.
arXiv Detail & Related papers (2024-04-08T21:15:36Z)
CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning Large Language Models to Coding Preferences [5.165576022684194]
We propose using the LLM-as-a-Judge methodology to evaluate the alignment of LLMs with coding preferences.<n>CodeUltraFeedback consists of 10,000 coding instructions, each annotated with four responses generated from a diverse pool of 14 LLMs.<n>In turn, we explore the usage of CodeUltraFeedback as feedback data to fine-tune and align CodeLlama-7B-Instruct using supervised fine-tuning (SFT) and reinforcement learning from AI feedback (RLAIF) with direct preference optimization (DPO)
arXiv Detail & Related papers (2024-03-14T01:51:35Z)
InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models [56.723509505549536]
InfiBench is the first large-scale freeform question-answering (QA) benchmark for code to our knowledge. It comprises 234 carefully selected high-quality Stack Overflow questions that span across 15 programming languages. We conduct a systematic evaluation for over 100 latest code LLMs on InfiBench, leading to a series of novel and insightful findings.
arXiv Detail & Related papers (2024-03-11T02:06:30Z)
Fixing Large Language Models' Specification Misunderstanding for Better Code Generation [13.494822086550604]
muFiX is a novel prompting technique to improve the code generation performance of large language models (LLMs)<n>It first exploits test case analysis to obtain specification understanding and enables a self-improvement process.<n>muFiX further fixes the specification understanding towards the direction reducing the gap between the provided understanding and the actual understanding.
arXiv Detail & Related papers (2023-09-28T02:58:07Z)
CodeT5+: Open Code Large Language Models for Code Understanding and Generation [72.1638273937025]
Large language models (LLMs) pretrained on vast source code have achieved prominent progress in code intelligence. CodeT5+ is a family of encoder-decoder LLMs for code in which component modules can be flexibly combined to suit a wide range of downstream code tasks. We extensively evaluate CodeT5+ on over 20 code-related benchmarks in different settings, including zero-shot, finetuning, and instruction-tuning.
arXiv Detail & Related papers (2023-05-13T14:23:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.