Related papers: CodeMorph: Mitigating Data Leakage in Large Language Model Assessment

CodeMorph: Mitigating Data Leakage in Large Language Model Assessment

URL: http://arxiv.org/abs/2506.17627v1
Date: Sat, 21 Jun 2025 08:04:12 GMT
Title: CodeMorph: Mitigating Data Leakage in Large Language Model Assessment
Authors: Hongzhou Rao, Yanjie Zhao, Wenjie Zhu, Ling Xiao, Meizhen Wang, Haoyu Wang,
Abstract summary: Concerns about benchmark leakage in large language models for code have raised issues of data contamination and inflated evaluation metrics.<n>We propose CodeMorph, an approach designed to support multiple programming languages while preserving cross-file dependencies to mitigate data leakage.
Score: 6.27974411661361
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Concerns about benchmark leakage in large language models for code (Code LLMs) have raised issues of data contamination and inflated evaluation metrics. The diversity and inaccessibility of many training datasets make it difficult to prevent data leakage entirely, even with time lag strategies. Consequently, generating new datasets through code perturbation has become essential. However, existing methods often fail to produce complex and diverse variations, struggle with complex cross-file dependencies, and lack support for multiple programming languages, which limits their effectiveness in enhancing LLM evaluations for coding tasks. To fill this gap, we propose CodeMorph, an approach designed to support multiple programming languages while preserving cross-file dependencies to mitigate data leakage. CodeMorph consists of two main components that work together to enhance the perturbation process. The first component employs 26 semantic-preserving transformation methods to iteratively perturb code, generating diverse variations while ensuring that the modified code remains compilable. The second component introduces a genetic algorithm-based selection algorithm, PESO, to identify the more effective perturbation method for each iteration by targeting lower similarity scores between the perturbed and original code, thereby enhancing overall perturbation effectiveness. Experimental results demonstrate that after applying CodeMorph, the accuracy of the LLM on code completion tasks across five programming languages decreased by an average of 24.67%, with Python showing the most significant reduction at 45%. The similarity score of code optimized by PESO is, on average, 7.01% lower than that of randomly perturbed code, peaking at a reduction of 42.86%.

Related papers

LLM-Based Detection of Tangled Code Changes for Higher-Quality Method-Level Bug Datasets [5.191767648600372]
We investigate the utility of Large Language Models for detecting tangled code changes by leveraging both commit messages and method-level code diffs.<n>Our results demonstrate that combining commit messages with code diffs significantly enhances model performance.<n>Applying our approach to 49 open-source projects improves the distributional separability of code metrics between buggy and non-buggy methods.
arXiv Detail & Related papers (2025-05-13T06:26:13Z)
Program Semantic Inequivalence Game with Large Language Models [10.358176296850639]
Large Language Models (LLMs) can achieve strong performance on everyday coding tasks, but they can fail on complex tasks that require non-trivial reasoning about program semantics.<n>In this work, we explore a method to synthetically generate code reasoning training data based on a semantic inequivalence game SInQ.<n>We prove that this setup enables theoretically unlimited improvement through self-play in the limit of infinite computational resources.
arXiv Detail & Related papers (2025-05-02T20:03:35Z)
ObscuraCoder: Powering Efficient Code LM Pre-Training Via Obfuscation Grounding [60.37988508851391]
Language models (LMs) have become a staple of the code-writing toolbox.<n>Research exploring modifications to Code-LMs' pre-training objectives, geared towards improving data efficiency and better disentangling between syntax and semantics, has been noticeably sparse.<n>In this work, we examine grounding on obfuscated code as a means of helping Code-LMs look beyond the surface-form syntax and enhance their pre-training sample efficiency.
arXiv Detail & Related papers (2025-03-27T23:08:53Z)
DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation [20.75363011870647]
DynaCode is a dynamic, complexity-aware benchmark for large language models (LLMs)<n>It evaluates LLMs systematically using a complexity-aware metric, incorporating both code complexity and call-graph structures.<n>Results on 12 latest LLMs show an average performance drop of 16.8% to 45.7% compared to MBPP+, a static code generation benchmark.
arXiv Detail & Related papers (2025-03-13T15:18:56Z)
EffiCoder: Enhancing Code Generation in Large Language Models through Efficiency-Aware Fine-tuning [17.355845751737423]
Large language models (LLMs) play an increasingly important role in code generation.<n>We introduce EffiCoder to improve both aspects by fine-tuning LLMs on a high-quality dataset.<n>EffiCoder offers a scalable and effective solution for advancing AI-driven code generation.
arXiv Detail & Related papers (2024-10-14T07:05:51Z)
What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions. We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types. We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z)
LLM-Assisted Code Cleaning For Training Accurate Code Generators [53.087019724256606]
We investigate data quality for code and find that making the code more structured and readable leads to improved code generation performance of the system. We build a novel data-cleaning pipeline that uses these principles to transform existing programs. We evaluate our approach on two challenging algorithmic code generation benchmarks and find that fine-tuning CodeLLaMa-7B improves the performance by up to 30% compared to fine-tuning on the original dataset.
arXiv Detail & Related papers (2023-11-25T02:45:50Z)
Contrastive Decoding Improves Reasoning in Large Language Models [55.16503283583076]
We show that Contrastive Decoding achieves large out-of-the-box improvements over greedy decoding on a variety of reasoning tasks. We show that Contrastive Decoding leads LLaMA-65B to outperform LLaMA 2, GPT-3.5 and PaLM 2-L on the HellaSwag commonsense reasoning benchmark.
arXiv Detail & Related papers (2023-09-17T00:29:32Z)
NAPG: Non-Autoregressive Program Generation for Hybrid Tabular-Textual Question Answering [52.10214317661547]
Current numerical reasoning methods autoregressively decode program sequences. The accuracy of program generation drops sharply as the decoding steps unfold due to error propagation. In this paper, we propose a non-autoregressive program generation framework.
arXiv Detail & Related papers (2022-11-07T11:25:21Z)
Coding for Distributed Multi-Agent Reinforcement Learning [12.366967700730449]
Stragglers arise frequently in a distributed learning system, due to the existence of various system disturbances. We propose a coded distributed learning framework, which speeds up the training of MARL algorithms in the presence of stragglers. Different coding schemes, including maximum distance separable (MDS)code, random sparse code, replication-based code, and regular low density parity check (LDPC) code are also investigated.
arXiv Detail & Related papers (2021-01-07T00:22:34Z)
Multi-scale Interactive Network for Salient Object Detection [91.43066633305662]
We propose the aggregate interaction modules to integrate the features from adjacent levels. To obtain more efficient multi-scale features, the self-interaction modules are embedded in each decoder unit. Experimental results on five benchmark datasets demonstrate that the proposed method without any post-processing performs favorably against 23 state-of-the-art approaches.
arXiv Detail & Related papers (2020-07-17T15:41:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.