Decoding Data Quality via Synthetic Corruptions: Embedding-guided
Pruning of Code Data
- URL: http://arxiv.org/abs/2312.02418v1
- Date: Tue, 5 Dec 2023 01:19:30 GMT
- Title: Decoding Data Quality via Synthetic Corruptions: Embedding-guided
Pruning of Code Data
- Authors: Yu Yang, Aaditya K. Singh, Mostafa Elhoushi, Anas Mahmoud, Kushal
Tirumala, Fabian Gloeckle, Baptiste Rozi\`ere, Carole-Jean Wu, Ari S. Morcos,
Newsha Ardalani
- Abstract summary: This work focuses on using embeddings to identify and remove "low-quality" code data.
First, we explore features of "low-quality" code in embedding space, through the use of synthetic corruptions.
We devise novel pruning metrics that operate in embedding space to identify and remove low-quality entries in the Stack dataset.
- Score: 22.461461600306688
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Code datasets, often collected from diverse and uncontrolled sources such as
GitHub, potentially suffer from quality issues, thereby affecting the
performance and training efficiency of Large Language Models (LLMs) optimized
for code generation. Previous studies demonstrated the benefit of using
embedding spaces for data pruning, but they mainly focused on duplicate removal
or increasing variety, and in other modalities, such as images. Our work
focuses on using embeddings to identify and remove "low-quality" code data.
First, we explore features of "low-quality" code in embedding space, through
the use of synthetic corruptions. Armed with this knowledge, we devise novel
pruning metrics that operate in embedding space to identify and remove
low-quality entries in the Stack dataset. We demonstrate the benefits of this
synthetic corruption informed pruning (SCIP) approach on the well-established
HumanEval and MBPP benchmarks, outperforming existing embedding-based methods.
Importantly, we achieve up to a 3% performance improvement over no pruning,
thereby showing the promise of insights from synthetic corruptions for data
pruning.
Related papers
- Synth4Seg -- Learning Defect Data Synthesis for Defect Segmentation using Bi-level Optimization [37.775768735361275]
We propose a bi-level optimization-based synthetic defect data generation framework.
We show that the proposed method can be used for learning the most effective locations for pasting synthetic defects.
We also demonstrate up to 2.6% performance gain by learning the importance weights for different augmentation-specific defect data sources.
arXiv Detail & Related papers (2024-10-24T07:25:12Z) - Code Less, Align More: Efficient LLM Fine-tuning for Code Generation with Data Pruning [4.975728472540823]
We present techniques that integrate various clustering and pruning metrics to selectively reduce training data without compromising the accuracy and functionality of the generated code.
Our experiments show that these pruning strategies not only reduce the computational resources needed but also enhance the overall quality code generation.
arXiv Detail & Related papers (2024-07-06T10:30:43Z) - Brevity is the soul of wit: Pruning long files for code generation [19.61423412870527]
We find that a simple--pruning long files--outperforms other methods in compute-limited regimes.
Our method can yield up to a 2x efficiency benefit in training (while matching performance) or a 3.5% absolute performance improvement on HumanEval.
arXiv Detail & Related papers (2024-06-29T13:08:24Z) - LLM-Assisted Code Cleaning For Training Accurate Code Generators [53.087019724256606]
We investigate data quality for code and find that making the code more structured and readable leads to improved code generation performance of the system.
We build a novel data-cleaning pipeline that uses these principles to transform existing programs.
We evaluate our approach on two challenging algorithmic code generation benchmarks and find that fine-tuning CodeLLaMa-7B improves the performance by up to 30% compared to fine-tuning on the original dataset.
arXiv Detail & Related papers (2023-11-25T02:45:50Z) - Improving a Named Entity Recognizer Trained on Noisy Data with a Few
Clean Instances [55.37242480995541]
We propose to denoise noisy NER data with guidance from a small set of clean instances.
Along with the main NER model we train a discriminator model and use its outputs to recalibrate the sample weights.
Results on public crowdsourcing and distant supervision datasets show that the proposed method can consistently improve performance with a small guidance set.
arXiv Detail & Related papers (2023-10-25T17:23:37Z) - Leveraging Generative AI: Improving Software Metadata Classification
with Generated Code-Comment Pairs [0.0]
In software development, code comments play a crucial role in enhancing code comprehension and collaboration.
This research paper addresses the challenge of objectively classifying code comments as "Useful" or "Not Useful"
We propose a novel solution that harnesses contextualized embeddings, particularly BERT, to automate this classification process.
arXiv Detail & Related papers (2023-10-14T12:09:43Z) - Rethinking Negative Pairs in Code Search [56.23857828689406]
We propose a simple yet effective Soft-InfoNCE loss that inserts weight terms into InfoNCE.
We analyze the effects of Soft-InfoNCE on controlling the distribution of learnt code representations and on deducing a more precise mutual information estimation.
arXiv Detail & Related papers (2023-10-12T06:32:42Z) - Discriminating Human-authored from ChatGPT-Generated Code Via
Discernable Feature Analysis [2.9398911304923447]
This paper specifically aims to distinguish code generated by ChatGPT from that authored by humans.
We devise a dataset cleansing technique, which employs temporal and spatial segmentation, to mitigate the dearth of datasets.
To further enrich data resources, we employ "code transformation," "feature transformation," and "feature customization" techniques, generating an extensive dataset comprising 10,000 lines of ChatGPT-generated code.
arXiv Detail & Related papers (2023-06-26T03:15:06Z) - Dataset Condensation with Latent Space Knowledge Factorization and
Sharing [73.31614936678571]
We introduce a novel approach for solving dataset condensation problem by exploiting the regularity in a given dataset.
Instead of condensing the dataset directly in the original input space, we assume a generative process of the dataset with a set of learnable codes.
We experimentally show that our method achieves new state-of-the-art records by significant margins on various benchmark datasets.
arXiv Detail & Related papers (2022-08-21T18:14:08Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - Auto-Encoding Twin-Bottleneck Hashing [141.5378966676885]
This paper proposes an efficient and adaptive code-driven graph.
It is updated by decoding in the context of an auto-encoder.
Experiments on benchmarked datasets clearly show the superiority of our framework over the state-of-the-art hashing methods.
arXiv Detail & Related papers (2020-02-27T05:58:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.