Related papers: In-Context Learning as an Effective Estimator of Functional Correctness of LLM-Generated Code

In-Context Learning as an Effective Estimator of Functional Correctness of LLM-Generated Code

URL: http://arxiv.org/abs/2507.05200v1
Date: Mon, 07 Jul 2025 17:01:17 GMT
Title: In-Context Learning as an Effective Estimator of Functional Correctness of LLM-Generated Code
Authors: Susmita Das, Madhusudan Ghosh, Priyanka Swami, Debasis Ganguly, Gul Calikli,
Abstract summary: We propose an in-context learning (ICL) based approach for code quality estimation.<n>Our findings demonstrate that providing few-shot examples of functionally correct code from a training set enhances the performance of existing QPP approaches.
Score: 8.40207342119367
License: http://creativecommons.org/licenses/by/4.0/
Abstract: When applying LLM-based code generation to software development projects that follow a feature-driven or rapid application development approach, it becomes necessary to estimate the functional correctness of the generated code in the absence of test cases. Just as a user selects a relevant document from a ranked list of retrieved ones, a software generation workflow requires a developer to choose (and potentially refine) a generated solution from a ranked list of alternative solutions, ordered by their posterior likelihoods. This implies that estimating the quality of a ranked list -- akin to estimating "relevance" for query performance prediction (QPP) in IR -- is also crucial for generative software development, where quality is defined in terms of "functional correctness". In this paper, we propose an in-context learning (ICL) based approach for code quality estimation. Our findings demonstrate that providing few-shot examples of functionally correct code from a training set enhances the performance of existing QPP approaches as well as a zero-shot-based approach for code quality estimation.

Related papers

CoQuIR: A Comprehensive Benchmark for Code Quality-Aware Information Retrieval [31.817325318218003]
CoQuIR is the first large-scale, multilingual benchmark designed to evaluate quality-aware code retrieval.<n>CoQuIR provides fine-grained quality annotations for 42,725 queries and 134,907 code snippets in 11 programming languages.
arXiv Detail & Related papers (2025-05-31T13:00:17Z)
Training Language Models to Generate Quality Code with Program Analysis Feedback [66.0854002147103]
Code generation with large language models (LLMs) is increasingly adopted in production but fails to ensure code quality.<n>We propose REAL, a reinforcement learning framework that incentivizes LLMs to generate production-quality code.
arXiv Detail & Related papers (2025-05-28T17:57:47Z)
Structure-Aware Corpus Construction and User-Perception-Aligned Metrics for Large-Language-Model Code Completion [5.771285831097908]
We propose two evaluation metrics for code completion tasks--LCP and ROUGE-LCP.<n>We also propose a data processing method based on a Structure-Preserving and Semantically-Reordered Code Graph.
arXiv Detail & Related papers (2025-05-19T13:09:32Z)
Causally Aligned Curriculum Learning [69.11672390876763]
This paper studies the problem of curriculum RL through causal lenses.<n>We derive a sufficient graphical condition characterizing causally aligned source tasks.<n>We develop an efficient algorithm to generate a causally aligned curriculum.
arXiv Detail & Related papers (2025-03-21T02:20:38Z)
Automated Refactoring of Non-Idiomatic Python Code: A Differentiated Replication with LLMs [54.309127753635366]
We present the results of a replication study in which we investigate GPT-4 effectiveness in recommending and suggesting idiomatic actions.<n>Our findings underscore the potential of LLMs to achieve tasks where, in the past, implementing recommenders based on complex code analyses was required.
arXiv Detail & Related papers (2025-01-28T15:41:54Z)
Generating refactored code accurately using reinforcement learning [3.179831861897336]
We propose a novel reinforcement learning-based approach for fine-tuning and aligning code language models to perform automated, intelligent extract method on Java source code.<n>Our approach fine-tunes sequence-to-sequence generative models and aligns them using the Proximal Policy Optimization (PPO) algorithm.<n>Our experiments demonstrate that our approach significantly enhances the performance of large language models in code.
arXiv Detail & Related papers (2024-12-23T23:09:48Z)
CodeDPO: Aligning Code Models with Self Generated and Verified Source Code [52.70310361822519]
We propose CodeDPO, a framework that integrates preference learning into code generation to improve two key code preference factors: code correctness and efficiency.<n>CodeDPO employs a novel dataset construction method, utilizing a self-generation-and-validation mechanism that simultaneously generates and evaluates code and test cases.
arXiv Detail & Related papers (2024-10-08T01:36:15Z)
Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning. LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors. We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z)
Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph [83.90988015005934]
Uncertainty quantification is a key element of machine learning applications.<n>We introduce a novel benchmark that implements a collection of state-of-the-art UQ baselines.<n>We conduct a large-scale empirical investigation of UQ and normalization techniques across eleven tasks, identifying the most effective approaches.
arXiv Detail & Related papers (2024-06-21T20:06:31Z)
On the Impacts of Contexts on Repository-Level Code Generation [5.641402231731082]
We present RepoExec, a novel benchmark designed to evaluate repository-level code generation.<n>We focus on three key aspects: executability, functional correctness through comprehensive test case generation, and accurate utilization of cross-file contexts.
arXiv Detail & Related papers (2024-06-17T10:45:22Z)
Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks. We instruct an LLM to self-evaluate its answers. We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.