Fugu-MT 論文翻訳(概要): Seed-Coder: Let the Code Model Curate Data for Itself

論文の概要: Seed-Coder: Let the Code Model Curate Data for Itself

arxiv url: http://arxiv.org/abs/2506.03524v1
Date: Wed, 04 Jun 2025 03:17:19 GMT
ステータス: 翻訳完了
システム内更新日: 2025-06-05 21:20:14.127108
Title: Seed-Coder: Let the Code Model Curate Data for Itself
Title（参考訳）: Seed-Coder: コードモデルが自身のためにデータをキュレートする
Authors: Yuyu Zhang, Jing Su, Yifan Sun, Chenguang Xi, Xia Xiao, Shen Zheng, Anxiang Zhang, Kaibo Liu, Daoguang Zan, Tao Sun, Jinhua Zhu, Shulin Xin, Dong Huang, Yetao Bai, Lixin Dong, Chao Li, Jianchong Chen, Hanzhi Zhou, Yifan Huang, Guanghan Ning, Xierui Song, Jiaze Chen, Siyao Liu, Kai Shen, Liang Xiang, Yonghui Wu,
Abstract要約: 8Bサイズの一連のオープンソースモデルであるSeed-Coderを紹介します。我々のコードはモデル中心のデータパイプラインによって生成される。 Seed-Coderは、同じサイズのオープンソースモデルの中で最先端の結果を達成する。
参考スコア（独自算出の注目度）: 42.12340347245302
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Code data in large language model (LLM) pretraining is recognized crucial not only for code-related tasks but also for enhancing general intelligence of LLMs. Current open-source LLMs often heavily rely on human effort to produce their code pretraining data, such as employing hand-crafted filtering rules tailored to individual programming languages, or using human-annotated data to train quality filters. However, these approaches are inherently limited in scalability, prone to subjective biases, and costly to extend and maintain across diverse programming languages. To address these challenges, we introduce Seed-Coder, a series of open-source LLMs comprising base, instruct and reasoning models of 8B size, minimizing human involvement in data construction. Our code pretraining data is produced by a model-centric data pipeline, which predominantly leverages LLMs for scoring and filtering code data. The instruct model is further trained via supervised fine-tuning and preference optimization, and the reasoning model leverages Long-Chain-of-Thought (LongCoT) reinforcement learning to improve multi-step code reasoning. Seed-Coder achieves state-of-the-art results among open-source models of similar size and even surpasses some much larger models, demonstrating superior performance in code generation, code completion, code editing, code reasoning, and software engineering tasks.
Abstract（参考訳）: 大規模言語モデル(LLM)の事前学習におけるコードデータは、コード関連タスクだけでなく、LLMの汎用知性向上にも重要であることが認識されている。現在のオープンソースのLLMは、個々のプログラミング言語に合わせて手作りのフィルタリングルールを使ったり、人間の注釈付きデータを使って品質のフィルターを訓練したりといった、データの事前訓練のための人間の努力に大きく依存していることが多い。しかしながら、これらのアプローチは本質的にスケーラビリティに制限され、主観的バイアスが生じる傾向があり、様々なプログラミング言語をまたいだ拡張と保守にコストがかかる。これらの課題に対処するために、Seed-Coderを紹介した。Seed-Coderは、ベース、命令および8Bサイズの推論モデルで構成され、データ構築における人間の関与を最小限に抑える。我々のコード事前トレーニングデータはモデル中心のデータパイプラインによって作成され、コードデータのスコア付けとフィルタリングにLLMを主に利用しています。命令モデルは、教師付き微調整と選好最適化によりさらに訓練され、推論モデルはLong-Chain-of-Thought(LongCoT)強化学習を利用して、マルチステップのコード推論を改善する。 Seed-Coderは、同じサイズのオープンソースモデルの中で最先端の結果を達成し、さらに大きなモデルも超え、コード生成、コード補完、コード編集、コード推論、ソフトウェアエンジニアリングタスクにおいて優れたパフォーマンスを示す。

論文の概要: Seed-Coder: Let the Code Model Curate Data for Itself

関連論文リスト