Decoding Data Quality via Synthetic Corruptions: Embedding-guided
Pruning of Code Data
- URL: http://arxiv.org/abs/2312.02418v1
- Date: Tue, 5 Dec 2023 01:19:30 GMT
- Title: Decoding Data Quality via Synthetic Corruptions: Embedding-guided
Pruning of Code Data
- Authors: Yu Yang, Aaditya K. Singh, Mostafa Elhoushi, Anas Mahmoud, Kushal
Tirumala, Fabian Gloeckle, Baptiste Rozi\`ere, Carole-Jean Wu, Ari S. Morcos,
Newsha Ardalani
- Abstract summary: This work focuses on using embeddings to identify and remove "low-quality" code data.
First, we explore features of "low-quality" code in embedding space, through the use of synthetic corruptions.
We devise novel pruning metrics that operate in embedding space to identify and remove low-quality entries in the Stack dataset.
- Score: 22.461461600306688
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Code datasets, often collected from diverse and uncontrolled sources such as
GitHub, potentially suffer from quality issues, thereby affecting the
performance and training efficiency of Large Language Models (LLMs) optimized
for code generation. Previous studies demonstrated the benefit of using
embedding spaces for data pruning, but they mainly focused on duplicate removal
or increasing variety, and in other modalities, such as images. Our work
focuses on using embeddings to identify and remove "low-quality" code data.
First, we explore features of "low-quality" code in embedding space, through
the use of synthetic corruptions. Armed with this knowledge, we devise novel
pruning metrics that operate in embedding space to identify and remove
low-quality entries in the Stack dataset. We demonstrate the benefits of this
synthetic corruption informed pruning (SCIP) approach on the well-established
HumanEval and MBPP benchmarks, outperforming existing embedding-based methods.
Importantly, we achieve up to a 3% performance improvement over no pruning,
thereby showing the promise of insights from synthetic corruptions for data
pruning.
Related papers
- UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance [65.01483640267885]
Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge.
We introduce UnitCoder, a systematic pipeline leveraging model-generated unit tests to guide and validate the code generation process.
Our work presents a scalable approach that leverages model-generated unit tests to guide the synthesis of high-quality code data from pre-training corpora.
arXiv Detail & Related papers (2025-02-17T05:37:02Z) - CoRNStack: High-Quality Contrastive Data for Better Code Ranking [45.18877655831977]
We introduce CoRNStack, a large-scale, high-quality contrastive training dataset for code that spans multiple programming languages.
This dataset is curated using consistency filtering to eliminate noisy positives and is further enriched with mined hard negatives.
We demonstrate that contrastive training of embedding models using CoRNStack leads to state-of-the-art performance across a variety of code retrieval tasks.
arXiv Detail & Related papers (2024-12-01T23:54:12Z) - Synth4Seg -- Learning Defect Data Synthesis for Defect Segmentation using Bi-level Optimization [37.775768735361275]
We propose a bi-level optimization-based synthetic defect data generation framework.
We show that the proposed method can be used for learning the most effective locations for pasting synthetic defects.
We also demonstrate up to 2.6% performance gain by learning the importance weights for different augmentation-specific defect data sources.
arXiv Detail & Related papers (2024-10-24T07:25:12Z) - Code Less, Align More: Efficient LLM Fine-tuning for Code Generation with Data Pruning [4.975728472540823]
We present techniques that integrate various clustering and pruning metrics to selectively reduce training data without compromising the accuracy and functionality of the generated code.
Our experiments show that these pruning strategies not only reduce the computational resources needed but also enhance the overall quality code generation.
arXiv Detail & Related papers (2024-07-06T10:30:43Z) - Brevity is the soul of wit: Pruning long files for code generation [19.61423412870527]
We find that a simple--pruning long files--outperforms other methods in compute-limited regimes.
Our method can yield up to a 2x efficiency benefit in training (while matching performance) or a 3.5% absolute performance improvement on HumanEval.
arXiv Detail & Related papers (2024-06-29T13:08:24Z) - LLM-Assisted Code Cleaning For Training Accurate Code Generators [53.087019724256606]
We investigate data quality for code and find that making the code more structured and readable leads to improved code generation performance of the system.
We build a novel data-cleaning pipeline that uses these principles to transform existing programs.
We evaluate our approach on two challenging algorithmic code generation benchmarks and find that fine-tuning CodeLLaMa-7B improves the performance by up to 30% compared to fine-tuning on the original dataset.
arXiv Detail & Related papers (2023-11-25T02:45:50Z) - Improving a Named Entity Recognizer Trained on Noisy Data with a Few
Clean Instances [55.37242480995541]
We propose to denoise noisy NER data with guidance from a small set of clean instances.
Along with the main NER model we train a discriminator model and use its outputs to recalibrate the sample weights.
Results on public crowdsourcing and distant supervision datasets show that the proposed method can consistently improve performance with a small guidance set.
arXiv Detail & Related papers (2023-10-25T17:23:37Z) - Rethinking Negative Pairs in Code Search [56.23857828689406]
We propose a simple yet effective Soft-InfoNCE loss that inserts weight terms into InfoNCE.
We analyze the effects of Soft-InfoNCE on controlling the distribution of learnt code representations and on deducing a more precise mutual information estimation.
arXiv Detail & Related papers (2023-10-12T06:32:42Z) - Discriminating Human-authored from ChatGPT-Generated Code Via
Discernable Feature Analysis [2.9398911304923447]
This paper specifically aims to distinguish code generated by ChatGPT from that authored by humans.
We devise a dataset cleansing technique, which employs temporal and spatial segmentation, to mitigate the dearth of datasets.
To further enrich data resources, we employ "code transformation," "feature transformation," and "feature customization" techniques, generating an extensive dataset comprising 10,000 lines of ChatGPT-generated code.
arXiv Detail & Related papers (2023-06-26T03:15:06Z) - Dataset Condensation with Latent Space Knowledge Factorization and
Sharing [73.31614936678571]
We introduce a novel approach for solving dataset condensation problem by exploiting the regularity in a given dataset.
Instead of condensing the dataset directly in the original input space, we assume a generative process of the dataset with a set of learnable codes.
We experimentally show that our method achieves new state-of-the-art records by significant margins on various benchmark datasets.
arXiv Detail & Related papers (2022-08-21T18:14:08Z) - Auto-Encoding Twin-Bottleneck Hashing [141.5378966676885]
This paper proposes an efficient and adaptive code-driven graph.
It is updated by decoding in the context of an auto-encoder.
Experiments on benchmarked datasets clearly show the superiority of our framework over the state-of-the-art hashing methods.
arXiv Detail & Related papers (2020-02-27T05:58:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.