LLM Performance for Code Generation on Noisy Tasks
- URL: http://arxiv.org/abs/2505.23598v1
- Date: Thu, 29 May 2025 16:11:18 GMT
- Title: LLM Performance for Code Generation on Noisy Tasks
- Authors: Radzim Sendyka, Christian Cabrera, Andrei Paleyes, Diana Robinson, Neil Lawrence,
- Abstract summary: We show that large language models (LLMs) can solve tasks obfuscated to a level where the text would be unintelligible to human readers.<n>We report empirical evidence of distinct performance decay patterns between contaminated and unseen datasets.<n>We propose measuring the decay of performance under obfuscation as a possible strategy for detecting dataset contamination.
- Score: 0.41942958779358674
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper investigates the ability of large language models (LLMs) to recognise and solve tasks which have been obfuscated beyond recognition. Focusing on competitive programming and benchmark tasks (LeetCode and MATH), we compare performance across multiple models and obfuscation methods, such as noise and redaction. We demonstrate that all evaluated LLMs can solve tasks obfuscated to a level where the text would be unintelligible to human readers, and does not contain key pieces of instruction or context. We introduce the concept of eager pattern matching to describe this behaviour, which is not observed in tasks published after the models' knowledge cutoff date, indicating strong memorisation or overfitting to training data, rather than legitimate reasoning about the presented problem. We report empirical evidence of distinct performance decay patterns between contaminated and unseen datasets. We discuss the implications for benchmarking and evaluations of model behaviour, arguing for caution when designing experiments using standard datasets. We also propose measuring the decay of performance under obfuscation as a possible strategy for detecting dataset contamination and highlighting potential safety risks and interpretability issues for automated software systems.
Related papers
- Revisiting Pre-trained Language Models for Vulnerability Detection [5.747350434960454]
The rapid advancement of pre-trained language models (PLMs) has demonstrated promising results for various code-related tasks.<n>However, their effectiveness in detecting real-world vulnerabilities remains a critical challenge.<n>This paper introduces RevisitVD, an extensive evaluation of 17 PLMs spanning smaller code-specific PLMs and large-scale PLMs.
arXiv Detail & Related papers (2025-07-22T17:58:49Z) - Reasoning Multimodal Large Language Model: Data Contamination and Dynamic Evaluation [9.434966074326056]
Multimodal Large Language Models (MLLMs) show impressive vision-language benchmark performance, yet growing concerns about data contamination risk masking true generalization.<n>We propose a novel dynamic evaluation framework to rigorously assess MLLM generalization, moving beyond static benchmarks.<n>We demonstrate that fine-tuning on simulated test data (extreme contamination) drastically sharpens task-specific performance but harms overall generalization.
arXiv Detail & Related papers (2025-06-08T15:52:38Z) - Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination [18.006532081289627]
We propose tool, a novel benchmarking suite for evaluating Code LLMs under potential data contamination.<n>tool employs multiple agents to extract and modify the context without altering the core logic, generating semantically equivalent variations.<n>Results show that tool effectively benchmarks reasoning capabilities under contamination risks while generating diverse problem sets to ensure consistent and reliable evaluations.
arXiv Detail & Related papers (2025-03-06T06:56:59Z) - Learning Task Representations from In-Context Learning [73.72066284711462]
Large language models (LLMs) have demonstrated remarkable proficiency in in-context learning.<n>We introduce an automated formulation for encoding task information in ICL prompts as a function of attention heads.<n>We show that our method's effectiveness stems from aligning the distribution of the last hidden state with that of an optimally performing in-context-learned model.
arXiv Detail & Related papers (2025-02-08T00:16:44Z) - Mitigating Forgetting in LLM Fine-Tuning via Low-Perplexity Token Learning [61.99353167168545]
We show that fine-tuning with LLM-generated data improves target task performance and reduces non-target task degradation.<n>This is the first work to provide an empirical explanation based on token perplexity reduction to mitigate catastrophic forgetting in LLMs after fine-tuning.
arXiv Detail & Related papers (2025-01-24T08:18:56Z) - What You See Is Not Always What You Get: An Empirical Study of Code Comprehension by Large Language Models [0.5735035463793009]
We investigate the vulnerability of large language models (LLMs) to imperceptible attacks, where hidden character manipulation in source code misleads LLMs' behaviour while remaining undetectable to human reviewers.<n>These attacks include coding reordering, invisible coding characters, code deletions, and code homoglyphs.<n>Our findings confirm the susceptibility of LLMs to imperceptible coding character attacks, while different LLMs present different negative correlations between perturbation magnitude and performance.
arXiv Detail & Related papers (2024-12-11T04:52:41Z) - Probing Language Models for Pre-training Data Detection [11.37731401086372]
We propose to utilize the probing technique for pre-training data detection by examining the model's internal activations.
Our method is simple and effective and leads to more trustworthy pre-training data detection.
arXiv Detail & Related papers (2024-06-03T13:58:04Z) - Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - Quantifying Contamination in Evaluating Code Generation Capabilities of
Language Models [27.24738197172374]
Large language models have achieved remarkable performance on various code generation benchmarks.
There have been growing concerns regarding potential contamination of these benchmarks as they may be leaked into pretraining and finetuning data.
We show that there are substantial overlap between popular code generation benchmarks and open training corpus, and models perform significantly better on the subset of the benchmarks where similar solutions are seen during training.
arXiv Detail & Related papers (2024-03-06T21:45:35Z) - KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models [53.84677081899392]
KIEval is a Knowledge-grounded Interactive Evaluation framework for large language models.
It incorporates an LLM-powered "interactor" role for the first time to accomplish a dynamic contamination-resilient evaluation.
Extensive experiments on seven leading LLMs across five datasets validate KIEval's effectiveness and generalization.
arXiv Detail & Related papers (2024-02-23T01:30:39Z) - Assessing Privacy Risks in Language Models: A Case Study on
Summarization Tasks [65.21536453075275]
We focus on the summarization task and investigate the membership inference (MI) attack.
We exploit text similarity and the model's resistance to document modifications as potential MI signals.
We discuss several safeguards for training summarization models to protect against MI attacks and discuss the inherent trade-off between privacy and utility.
arXiv Detail & Related papers (2023-10-20T05:44:39Z) - Revisit Input Perturbation Problems for LLMs: A Unified Robustness
Evaluation Framework for Noisy Slot Filling Task [18.623619585980688]
We propose a unified robustness evaluation framework based on the slot-filling task to evaluate the dialogue understanding capability of large language models.
Specifically, we construct a input perturbation evaluation dataset, Noise-LLM, which contains five types of single perturbation and four types of mixed perturbation data.
Our aim is to assess how well various robustness methods of LLMs perform in real-world noisy scenarios.
arXiv Detail & Related papers (2023-10-10T10:22:05Z) - Discover, Explanation, Improvement: An Automatic Slice Detection
Framework for Natural Language Processing [72.14557106085284]
slice detection models (SDM) automatically identify underperforming groups of datapoints.
This paper proposes a benchmark named "Discover, Explain, improve (DEIM)" for classification NLP tasks.
Our evaluation shows that Edisa can accurately select error-prone datapoints with informative semantic features.
arXiv Detail & Related papers (2022-11-08T19:00:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.