Related papers: Arctic-SnowCoder: Demystifying High-Quality Data in Code Pretraining

Arctic-SnowCoder: Demystifying High-Quality Data in Code Pretraining

URL: http://arxiv.org/abs/2409.02326v1
Date: Tue, 3 Sep 2024 22:36:42 GMT
Title: Arctic-SnowCoder: Demystifying High-Quality Data in Code Pretraining
Authors: Yuxiang Wei, Hojae Han, Rajhans Samdani,
Abstract summary: Arctic-SnowCoder-1.3B is a data-efficient base code model pretrained on 555B tokens. Despite being trained on a limited dataset, Arctic-SnowCoder achieves state-of-the-art performance on BigCodeBench. Across all evaluated benchmarks, Arctic-SnowCoder-1.3B beats StarCoderBase-3B pretrained on 1T tokens.
Score: 3.8608102686867762
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent studies have been increasingly demonstrating that high-quality data is crucial for effective pretraining of language models. However, the precise definition of "high-quality" remains underexplored. Focusing on the code domain, we introduce Arctic-SnowCoder-1.3B, a data-efficient base code model pretrained on 555B tokens through three phases of progressively refined data: (1) general pretraining with 500B standard-quality code tokens, preprocessed through basic filtering, deduplication, and decontamination, (2) continued pretraining with 50B high-quality tokens, selected from phase one by a BERT-style quality annotator trained to distinguish good code from random data, using positive examples drawn from high-quality code files, along with instruction data from Magicoder and StarCoder2-Instruct, and (3) enhanced pretraining with 5B synthetic data created by Llama-3.1-70B using phase two data as seeds, adapting the Magicoder approach for pretraining. Despite being trained on a limited dataset, Arctic-SnowCoder achieves state-of-the-art performance on BigCodeBench, a coding benchmark focusing on practical and challenging programming tasks, compared to similarly sized models trained on no more than 1T tokens, outperforming Phi-1.5-1.3B by 36%. Across all evaluated benchmarks, Arctic-SnowCoder-1.3B beats StarCoderBase-3B pretrained on 1T tokens. Additionally, it matches the performance of leading small base code models trained on trillions of tokens. For example, Arctic-SnowCoder-1.3B surpasses StarCoder2-3B, pretrained on over 3.3T tokens, on HumanEval+, a benchmark that evaluates function-level code generation, and remains competitive on BigCodeBench. Our evaluation presents a comprehensive analysis justifying various design choices for Arctic-SnowCoder. Most importantly, we find that the key to high-quality data is its alignment with the distribution of downstream applications.

Related papers

KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding [49.56049319037421]
KodCode is a synthetic dataset that addresses the persistent challenge of acquiring high-quality, verifiable training data. It comprises question-solution-test triplets that are systematically validated via a self-verification procedure. This pipeline yields a large-scale, robust and diverse coding dataset.
arXiv Detail & Related papers (2025-03-04T19:17:36Z)
UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance [65.01483640267885]
Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge. We introduce UnitCoder, a systematic pipeline leveraging model-generated unit tests to guide and validate the code generation process. Our work presents a scalable approach that leverages model-generated unit tests to guide the synthesis of high-quality code data from pre-training corpora.
arXiv Detail & Related papers (2025-02-17T05:37:02Z)
Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning [65.2421542320293]
Reasoning abilities are crucial components of general intelligence. Recent advances by proprietary companies, such as o-series models of OpenAI, have made remarkable progress on reasoning tasks. This paper proposes a new RL framework, termed OREAL, to pursue the performance limit that can be achieved through textbfOutcome textbfREwtextbfArd-based reinforcement textbfLearning for mathematical reasoning tasks.
arXiv Detail & Related papers (2025-02-10T18:57:29Z)
ACECODER: Acing Coder RL via Automated Test-Case Synthesis [36.740393665032954]
We design a pipeline that generates extensive (question, test-cases) pairs from existing code data. We construct preference pairs based on pass rates over sampled programs to train reward models with Bradley-Terry loss. We show that our RL training can improve model on HumanEval-plus by over 25% and MBPP-plus by 6% for merely 80 optimization steps.
arXiv Detail & Related papers (2025-02-03T18:46:04Z)
Let Me DeCode You: Decoder Conditioning with Tabular Data [0.15487122608774898]
We introduce a novel approach, DeCode, that utilizes label-derived features for model conditioning to support the decoder in the reconstruction process dynamically. DeCode focuses on improving 3D segmentation performance through the incorporation of conditioning embedding with learned numerical representation of 3D-label shape features. Our results show that DeCode significantly outperforms traditional, unconditioned models in terms of generalization to unseen data, achieving higher accuracy at a reduced computational cost.
arXiv Detail & Related papers (2024-07-12T17:14:33Z)
Co-training for Low Resource Scientific Natural Language Inference [65.37685198688538]
We propose a novel co-training method that assigns weights based on the training dynamics of the classifiers to the distantly supervised labels. By assigning importance weights instead of filtering out examples based on an arbitrary threshold on the predicted confidence, we maximize the usage of automatically labeled data. The proposed method obtains an improvement of 1.5% in Macro F1 over the distant supervision baseline, and substantial improvements over several other strong SSL baselines.
arXiv Detail & Related papers (2024-06-20T18:35:47Z)
How to get better embeddings with code pre-trained models? An empirical study [6.220333404184779]
We study five different code pre-trained models (PTMs) to generate embeddings for downstream classification tasks. We find that embeddings obtained through special tokens do not sufficiently aggregate the semantic information of the entire code snippet. The quality of code embeddings obtained by combing code data and text data in the same way as pre-training the PTMs is poor and cannot guarantee richer semantic information.
arXiv Detail & Related papers (2023-11-14T10:44:21Z)
Automatic Unit Test Data Generation and Actor-Critic Reinforcement Learning for Code Synthesis [16.88062487980405]
We present a novel approach to automatically obtain data consisting of function signatures and associated Unit Tests. We show that it, in conjunction with automatically generated training data, leads to improvement of a pre-trained code language model's performance.
arXiv Detail & Related papers (2023-10-20T17:13:16Z)
The EarlyBIRD Catches the Bug: On Exploiting Early Layers of Encoder Models for More Efficient Code Classification [7.205265729540538]
Training deep NLP models requires significant computational resources. We propose a generic approach, EarlyBIRD, to build composite representations of code from the early layers of a pre-trained transformer model.
arXiv Detail & Related papers (2023-05-08T16:47:28Z)
Towards Efficient Fine-tuning of Pre-trained Code Models: An Experimental Study and Beyond [52.656743602538825]
Fine-tuning pre-trained code models incurs a large computational cost. We conduct an experimental study to explore what happens to layer-wise pre-trained representations and their encoded code knowledge during fine-tuning. We propose Telly to efficiently fine-tune pre-trained code models via layer freezing.
arXiv Detail & Related papers (2023-04-11T13:34:13Z)
PEOPL: Characterizing Privately Encoded Open Datasets with Public Labels [59.66777287810985]
We introduce information-theoretic scores for privacy and utility, which quantify the average performance of an unfaithful user. We then theoretically characterize primitives in building families of encoding schemes that motivate the use of random deep neural networks.
arXiv Detail & Related papers (2023-03-31T18:03:53Z)
CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning [92.36705236706678]
"CodeRL" is a new framework for program synthesis tasks through pretrained LMs and deep reinforcement learning. During inference, we introduce a new generation procedure with a critical sampling strategy. For the model backbones, we extended the encoder-decoder architecture of CodeT5 with enhanced learning objectives.
arXiv Detail & Related papers (2022-07-05T02:42:15Z)
Enhancing Semantic Code Search with Multimodal Contrastive Learning and Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search. We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z)
Auto-Encoding Twin-Bottleneck Hashing [141.5378966676885]
This paper proposes an efficient and adaptive code-driven graph. It is updated by decoding in the context of an auto-encoder. Experiments on benchmarked datasets clearly show the superiority of our framework over the state-of-the-art hashing methods.
arXiv Detail & Related papers (2020-02-27T05:58:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.