Related papers: Decoding Data Quality via Synthetic Corruptions: Embedding-guided Pruning of Code Data

Decoding Data Quality via Synthetic Corruptions: Embedding-guided Pruning of Code Data

URL: http://arxiv.org/abs/2312.02418v1
Date: Tue, 5 Dec 2023 01:19:30 GMT
Title: Decoding Data Quality via Synthetic Corruptions: Embedding-guided Pruning of Code Data
Authors: Yu Yang, Aaditya K. Singh, Mostafa Elhoushi, Anas Mahmoud, Kushal Tirumala, Fabian Gloeckle, Baptiste Rozi\`ere, Carole-Jean Wu, Ari S. Morcos, Newsha Ardalani
Abstract summary: This work focuses on using embeddings to identify and remove "low-quality" code data. First, we explore features of "low-quality" code in embedding space, through the use of synthetic corruptions. We devise novel pruning metrics that operate in embedding space to identify and remove low-quality entries in the Stack dataset.
Score: 22.461461600306688
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Code datasets, often collected from diverse and uncontrolled sources such as GitHub, potentially suffer from quality issues, thereby affecting the performance and training efficiency of Large Language Models (LLMs) optimized for code generation. Previous studies demonstrated the benefit of using embedding spaces for data pruning, but they mainly focused on duplicate removal or increasing variety, and in other modalities, such as images. Our work focuses on using embeddings to identify and remove "low-quality" code data. First, we explore features of "low-quality" code in embedding space, through the use of synthetic corruptions. Armed with this knowledge, we devise novel pruning metrics that operate in embedding space to identify and remove low-quality entries in the Stack dataset. We demonstrate the benefits of this synthetic corruption informed pruning (SCIP) approach on the well-established HumanEval and MBPP benchmarks, outperforming existing embedding-based methods. Importantly, we achieve up to a 3% performance improvement over no pruning, thereby showing the promise of insights from synthetic corruptions for data pruning.

Related papers

Learning to Solve and Verify: A Self-Play Framework for Code and Test Generation [69.62857948698436]
Recent advances in large language models (LLMs) have improved their performance on coding benchmarks. However, improvement is plateauing due to the exhaustion of readily available high-quality data. We propose Sol-Ver, a self-play solver-verifier framework that jointly improves a single model's code and test generation capacity.
arXiv Detail & Related papers (2025-02-20T18:32:19Z)
UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance [65.01483640267885]
Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge. We introduce UnitCoder, a systematic pipeline leveraging model-generated unit tests to guide and validate the code generation process. Our work presents a scalable approach that leverages model-generated unit tests to guide the synthesis of high-quality code data from pre-training corpora.
arXiv Detail & Related papers (2025-02-17T05:37:02Z)
CoRNStack: High-Quality Contrastive Data for Better Code Retrieval and Reranking [45.18877655831977]
We introduce CoRNStack, a large-scale, high-quality contrastive training dataset for code that spans multiple programming languages. This dataset is curated using consistency filtering to eliminate noisy positives and is further enriched with mined hard negatives. We demonstrate that contrastive training of embedding models using CoRNStack leads to state-of-the-art performance across a variety of code retrieval tasks.
arXiv Detail & Related papers (2024-12-01T23:54:12Z)
Synth4Seg -- Learning Defect Data Synthesis for Defect Segmentation using Bi-level Optimization [37.775768735361275]
We propose a bi-level optimization-based synthetic defect data generation framework. We show that the proposed method can be used for learning the most effective locations for pasting synthetic defects. We also demonstrate up to 2.6% performance gain by learning the importance weights for different augmentation-specific defect data sources.
arXiv Detail & Related papers (2024-10-24T07:25:12Z)
Code Less, Align More: Efficient LLM Fine-tuning for Code Generation with Data Pruning [4.975728472540823]
We present techniques that integrate various clustering and pruning metrics to selectively reduce training data without compromising the accuracy and functionality of the generated code. Our experiments show that these pruning strategies not only reduce the computational resources needed but also enhance the overall quality code generation.
arXiv Detail & Related papers (2024-07-06T10:30:43Z)
Brevity is the soul of wit: Pruning long files for code generation [19.61423412870527]
We find that a simple--pruning long files--outperforms other methods in compute-limited regimes. Our method can yield up to a 2x efficiency benefit in training (while matching performance) or a 3.5% absolute performance improvement on HumanEval.
arXiv Detail & Related papers (2024-06-29T13:08:24Z)
LLM-Assisted Code Cleaning For Training Accurate Code Generators [53.087019724256606]
We investigate data quality for code and find that making the code more structured and readable leads to improved code generation performance of the system. We build a novel data-cleaning pipeline that uses these principles to transform existing programs. We evaluate our approach on two challenging algorithmic code generation benchmarks and find that fine-tuning CodeLLaMa-7B improves the performance by up to 30% compared to fine-tuning on the original dataset.
arXiv Detail & Related papers (2023-11-25T02:45:50Z)
Improving a Named Entity Recognizer Trained on Noisy Data with a Few Clean Instances [55.37242480995541]
We propose to denoise noisy NER data with guidance from a small set of clean instances. Along with the main NER model we train a discriminator model and use its outputs to recalibrate the sample weights. Results on public crowdsourcing and distant supervision datasets show that the proposed method can consistently improve performance with a small guidance set.
arXiv Detail & Related papers (2023-10-25T17:23:37Z)
Leveraging Generative AI: Improving Software Metadata Classification with Generated Code-Comment Pairs [0.0]
In software development, code comments play a crucial role in enhancing code comprehension and collaboration. This research paper addresses the challenge of objectively classifying code comments as "Useful" or "Not Useful" We propose a novel solution that harnesses contextualized embeddings, particularly BERT, to automate this classification process.
arXiv Detail & Related papers (2023-10-14T12:09:43Z)
Rethinking Negative Pairs in Code Search [56.23857828689406]
We propose a simple yet effective Soft-InfoNCE loss that inserts weight terms into InfoNCE. We analyze the effects of Soft-InfoNCE on controlling the distribution of learnt code representations and on deducing a more precise mutual information estimation.
arXiv Detail & Related papers (2023-10-12T06:32:42Z)
Discriminating Human-authored from ChatGPT-Generated Code Via Discernable Feature Analysis [2.9398911304923447]
This paper specifically aims to distinguish code generated by ChatGPT from that authored by humans. We devise a dataset cleansing technique, which employs temporal and spatial segmentation, to mitigate the dearth of datasets. To further enrich data resources, we employ "code transformation," "feature transformation," and "feature customization" techniques, generating an extensive dataset comprising 10,000 lines of ChatGPT-generated code.
arXiv Detail & Related papers (2023-06-26T03:15:06Z)
Dataset Condensation with Latent Space Knowledge Factorization and Sharing [73.31614936678571]
We introduce a novel approach for solving dataset condensation problem by exploiting the regularity in a given dataset. Instead of condensing the dataset directly in the original input space, we assume a generative process of the dataset with a set of learnable codes. We experimentally show that our method achieves new state-of-the-art records by significant margins on various benchmark datasets.
arXiv Detail & Related papers (2022-08-21T18:14:08Z)
Enhancing Semantic Code Search with Multimodal Contrastive Learning and Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search. We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z)
Auto-Encoding Twin-Bottleneck Hashing [141.5378966676885]
This paper proposes an efficient and adaptive code-driven graph. It is updated by decoding in the context of an auto-encoder. Experiments on benchmarked datasets clearly show the superiority of our framework over the state-of-the-art hashing methods.
arXiv Detail & Related papers (2020-02-27T05:58:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.