Fugu-MT 論文翻訳(概要): SCoder: Iterative Self-Distillation for Bootstrapping Small-Scale Data Synthesizers to Empower Code LLMs

論文の概要: SCoder: Iterative Self-Distillation for Bootstrapping Small-Scale Data Synthesizers to Empower Code LLMs

arxiv url: http://arxiv.org/abs/2509.07858v1
Date: Tue, 09 Sep 2025 15:38:44 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-10 14:38:27.381832
Title: SCoder: Iterative Self-Distillation for Bootstrapping Small-Scale Data Synthesizers to Empower Code LLMs
Title（参考訳）: SCoder: 小型データ合成装置をブートストラップしてコードLLMを活用するための反復自己蒸留
Authors: Xinyu Zhang, Changzhi Zhou, Linmei Hu, Luhao Zhang, Xiancai Chen, Haomin Fu, Yang Yang, Mengdi Zhang,
Abstract要約: 既存のコード大言語モデル(LLM)は、しばしば微調整のために独自LLMから抽出された大規模な命令データに依存している。本稿では,小型LLMをブートストラップし,それらを強力な合成器に変換する,新しい反復自己蒸留手法を提案する。我々は、DeepSeek-Coderから微調整されたコード生成モデルのファミリーであるSCoderを開発した。
参考スコア（独自算出の注目度）: 16.273922496570155
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Existing code large language models (LLMs) often rely on large-scale instruction data distilled from proprietary LLMs for fine-tuning, which typically incurs high costs. In this paper, we explore the potential of small-scale open-source LLMs (e.g., 7B) as synthesizers for high-quality code instruction data construction. We first observe that the data synthesis capability of small-scale LLMs can be enhanced by training on a few superior data synthesis samples from proprietary LLMs. Building on this, we propose a novel iterative self-distillation approach to bootstrap small-scale LLMs, transforming them into powerful synthesizers that reduce reliance on proprietary LLMs and minimize costs. Concretely, in each iteration, to obtain diverse and high-quality self-distilled data, we design multi-checkpoint sampling and multi-aspect scoring strategies for initial data selection. Furthermore, to identify the most influential samples, we introduce a gradient-based influence estimation method for final data filtering. Based on the code instruction datasets from the small-scale synthesizers, we develop SCoder, a family of code generation models fine-tuned from DeepSeek-Coder. SCoder models achieve state-of-the-art code generation capabilities, demonstrating the effectiveness of our method.
Abstract（参考訳）: 既存のコード大言語モデル(LLM)は、通常高いコストを発生させるような微調整のために独自LLMから抽出された大規模な命令データに依存していることが多い。本稿では、高品質なコード命令データ構築のためのシンセサイザーとして、小規模オープンソースLLM(eg, 7B)の可能性について検討する。まず,小規模LLMのデータ合成能力を,プロプライエタリLLMの優れたデータ合成サンプルのトレーニングにより向上させることができることを考察した。そこで本研究では,小規模LLMをブートストラップする新たな自己蒸留手法を提案し,これを強力なシンセサイザーに変換することで,独自LLMへの依存を低減し,コストを最小限に抑える。具体的には、各イテレーションにおいて、多変量かつ高品質な自己蒸留データを得るために、初期データ選択のためのマルチチェックポイントサンプリングとマルチアスペクトスコア戦略を設計する。さらに、最も影響力のあるサンプルを特定するために、最終データフィルタリングのための勾配に基づく影響推定手法を提案する。小型シンセサイザーのコード命令データセットに基づいて、DeepSeek-Coderから微調整されたコード生成モデルのファミリーであるSCoderを開発する。 SCoderモデルは最先端のコード生成機能を実現し,提案手法の有効性を実証する。

論文の概要: SCoder: Iterative Self-Distillation for Bootstrapping Small-Scale Data Synthesizers to Empower Code LLMs

関連論文リスト