ReCatcher: Towards LLMs Regression Testing for Code Generation
- URL: http://arxiv.org/abs/2507.19390v1
- Date: Fri, 25 Jul 2025 15:45:55 GMT
- Title: ReCatcher: Towards LLMs Regression Testing for Code Generation
- Authors: Altaf Allah Abbassi, Leuson Da Silva, Amin Nikanjam, Foutse Khomh,
- Abstract summary: ReCatcher is a regression testing framework for Python code generation.<n>We apply ReCatcher to assess regressions across three update scenarios, fine-tuning, merging, and model release.<n>Merging with general-purpose models like Llama2 leads to regressions in correctness by up to 18%.
- Score: 11.185300073739098
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) for code generation evolve rapidly through fine-tuning, merging, or new model releases. However, such updates can introduce regressions, not only in correctness but also in code quality and performance. To address this, we present ReCatcher, a regression testing framework for Python code generation. ReCatcher systematically compares two LLMs, typically a current model and a candidate update, across three dimensions: logical correctness, static code quality, and execution performance. We apply ReCatcher to assess regressions across three update scenarios, fine-tuning, merging, and model release, using CodeLlama, DeepSeek-Coder, and GPT-4o. Our evaluation shows that fine-tuning with cross-language datasets increases syntax errors by up to 12%. Merging with general-purpose models like Llama2 leads to regressions in correctness by up to 18%. GPT-4o introduces regressions of up to 50% in handling missing imports compared to GPT-3.5-turbo, while GPT-4o-mini suffers up to 80% performance degradation in execution time versus GPT-4o. Overall, logical correctness, performance, and error handling (e.g., syntax errors and missing imports) are the most regression-prone areas. Comparing ReCatcher with baseline solutions, it presents better and consistent accuracy across logical and performance aspects. ReCatcher highlights the importance of systematic regression evaluation before adopting new models, while assisting researchers and practitioners in making more informed update decisions.
Related papers
- Ensuring Reproducibility in Generative AI Systems for General Use Cases: A Framework for Regression Testing and Open Datasets [0.0]
We introduce GPR-bench, a benchmark that operationalizes regression testing for general purpose use cases.<n>We show that newer models generally improve correctness, but the differences are modest and not statistically significant.<n>In contrast, the concise-writing instruction significantly enhances conciseness, demonstrating the effectiveness of prompt engineering.
arXiv Detail & Related papers (2025-05-02T12:31:43Z) - Sparse Regression for Machine Translation [0.0]
We show the effectiveness of transductive regression techniques to learn mappings between source and target features of given parallel corpora.
We present encouraging results when translating from German to English and Spanish to English.
arXiv Detail & Related papers (2024-06-27T18:43:51Z) - RAP-Gen: Retrieval-Augmented Patch Generation with CodeT5 for Automatic
Program Repair [75.40584530380589]
We propose a novel Retrieval-Augmented Patch Generation framework (RAP-Gen)
RAP-Gen explicitly leveraging relevant fix patterns retrieved from a list of previous bug-fix pairs.
We evaluate RAP-Gen on three benchmarks in two programming languages, including the TFix benchmark in JavaScript, and Code Refinement and Defects4J benchmarks in Java.
arXiv Detail & Related papers (2023-09-12T08:52:56Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - QLoRA: Efficient Finetuning of Quantized LLMs [66.58009990713134]
We present QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU.
QLoRA backpropagates through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters(LoRA)
Our best model family, which we name Guanaco, outperforms all previous openly released models on the Vicuna benchmark.
arXiv Detail & Related papers (2023-05-23T17:50:33Z) - Learning Label Encodings for Deep Regression [10.02230163797581]
Deep regression networks are widely used to tackle the problem of predicting a continuous value for a given input.
The space of label encodings for regression is large.
This paper introduces Regularized Label Learning (RLEL) for end-to-end training of an entire network and its label encoding.
arXiv Detail & Related papers (2023-03-04T00:11:34Z) - Learning to Learn to Predict Performance Regressions in Production at
Meta [11.45540873578889]
This article gives an account of the experiences we gained when researching and deploying an ML-based regression prediction pipeline at Meta.
Our investigation shows the inherent difficulty of the performance prediction problem, which is characterized by a large imbalance of benign onto regressing changes.
Our results also call into question the general applicability of Transformer-based architectures for performance prediction.
arXiv Detail & Related papers (2022-08-08T18:16:51Z) - Stochastic Gradient Descent without Full Data Shuffle [65.97105896033815]
CorgiPile is a hierarchical data shuffling strategy that avoids a full data shuffle while maintaining comparable convergence rate of SGD as if a full shuffle were performed.
Our results show that CorgiPile can achieve comparable convergence rate with the full shuffle based SGD for both deep learning and generalized linear models.
arXiv Detail & Related papers (2022-06-12T20:04:31Z) - Regression Bugs Are In Your Model! Measuring, Reducing and Analyzing
Regressions In NLP Model Updates [68.09049111171862]
This work focuses on quantifying, reducing and analyzing regression errors in the NLP model updates.
We formulate the regression-free model updates into a constrained optimization problem.
We empirically analyze how model ensemble reduces regression.
arXiv Detail & Related papers (2021-05-07T03:33:00Z) - RepPoints V2: Verification Meets Regression for Object Detection [65.120827759348]
We introduce verification tasks into the localization prediction of RepPoints.
RepPoints v2 provides consistent improvements of about 2.0 mAP over the original RepPoints.
We show that the proposed approach can more generally elevate other object detection frameworks as well as applications such as instance segmentation.
arXiv Detail & Related papers (2020-07-16T17:57:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.