Fugu-MT 論文翻訳(概要): Increasing LLM Coding Capabilities through Diverse Synthetic Coding Tasks

論文の概要: Increasing LLM Coding Capabilities through Diverse Synthetic Coding Tasks

arxiv url: http://arxiv.org/abs/2510.23208v1
Date: Mon, 27 Oct 2025 10:54:25 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-28 15:28:15.52669
Title: Increasing LLM Coding Capabilities through Diverse Synthetic Coding Tasks
Title（参考訳）: 逆合成符号化タスクによるLLM符号化能力の向上
Authors: Amal Abed, Ivan Lukic, Jörg K. H. Franke, Frank Hutter,
Abstract要約: 大規模言語モデル(LLM)は、コード生成において素晴らしい可能性を示しています。 800k近い命令推論コードテスト四重項を生成するスケーラブルな合成データ生成パイプラインを提案する。
参考スコア（独自算出の注目度）: 41.75017840131367
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have shown impressive promise in code generation, yet their progress remains limited by the shortage of large-scale datasets that are both diverse and well-aligned with human reasoning. Most existing resources pair problems with solutions, but omit the intermediate thought process that guides coding. To close this gap, we present a scalable synthetic data generation pipeline that produces nearly 800k instruction-reasoning-code-test quadruplets. Each sample combines a task, a step-by-step reasoning trace, a working solution, and executable tests, enabling models to learn not just the what but also the how of problem solving. Our pipeline combines four key components: curated contest problems, web-mined content filtered by relevance classifiers, data expansion guided by reasoning patterns, and multi-stage execution-based validation. A genetic mutation algorithm further increases task diversity while maintaining consistency between reasoning traces and code implementations. Our key finding is that fine-tuning LLMs on this dataset yields consistent improvements on coding benchmarks. Beyond raw accuracy, reasoning-aware data can substitute for model scaling, generalize across architectures, and outperform leading open-source alternatives under identical sample budgets. Our work establishes reasoning-centered synthetic data generation as an efficient approach for advancing coding capabilities in LLMs. We publish our dataset and generation pipeline to facilitate further research.
Abstract（参考訳）: 大規模言語モデル(LLM)は、コード生成において目覚ましい将来性を示しているが、その進歩は、多様かつ人間の推論に整合した大規模データセットの不足によって制限されている。既存のリソースのほとんどはソリューションと問題をペアリングするが、コーディングを導く中間的思考プロセスを省略する。このギャップを埋めるために,800k近い命令推論コードテスト四重項を生成するスケーラブルな合成データ生成パイプラインを提案する。各サンプルは、タスク、ステップバイステップの推論トレース、動作するソリューション、実行可能なテストを組み合わせることで、モデルが問題の解決方法だけでなく、どのようにして学ぶことができる。我々のパイプラインは、キュレートされたコンテスト問題、関連分類器によってフィルタリングされたWebマイニングコンテンツ、推論パターンでガイドされたデータ拡張、マルチステージ実行ベースの検証の4つの重要なコンポーネントを組み合わせています。遺伝的突然変異アルゴリズムは、推論トレースとコード実装の一貫性を維持しながら、タスクの多様性をさらに向上させる。私たちの重要な発見は、このデータセット上の微調整LDMは、コーディングベンチマークにおいて一貫した改善をもたらすということです。生の正確性以外にも、推論対応のデータは、モデルのスケーリングに代えて、アーキテクチャを一般化し、同じサンプル予算の下で主要なオープンソース代替品より優れている。我々の研究は、LLMにおける符号化能力向上のための効率的なアプローチとして、推論中心の合成データ生成を確立する。我々は、さらなる研究を促進するためにデータセットと生成パイプラインを公開します。

論文の概要: Increasing LLM Coding Capabilities through Diverse Synthetic Coding Tasks

関連論文リスト