Fugu-MT 論文翻訳(概要): Chain of Execution Supervision Promotes General Reasoning in Large Language Models

論文の概要: Chain of Execution Supervision Promotes General Reasoning in Large Language Models

arxiv url: http://arxiv.org/abs/2510.23629v1
Date: Fri, 24 Oct 2025 02:21:11 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-29 15:35:36.296305
Title: Chain of Execution Supervision Promotes General Reasoning in Large Language Models
Title（参考訳）: 実行スーパービジョンの連鎖は大規模言語モデルにおける一般的な推論を促進する
Authors: Nuo Chen, Zehua Li, Keqin Bao, Junyang Lin, Dayiheng Liu,
Abstract要約: TracePileは260万のサンプルからなる大規模なコーパスで、コード実行を明示的でステップバイステップのチェーン・オブ・シンクスタイルの論理に変換する。我々は,継続事前訓練,事前訓練後の指導訓練,2段階微調整という3つのトレーニング設定を用いてTracePileを評価する。特にTracePileは、9つの数学データセットでLLaMA3.1-8Bを平均7.1%向上させ、LiveCodeBench、CRUX、MMLUで明確なゲインを提供する。
参考スコア（独自算出の注目度）: 48.100128916029064
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Building robust and general reasoning ability is a central goal in the development of large language models (LLMs). Recent efforts increasingly turn to code as a rich training source, given its inherent logical structure and diverse reasoning paradigms such as divide-and-conquer, topological ordering, and enumeration. However, reasoning in code is often expressed implicitly and entangled with syntactic or implementation noise, making direct training on raw code suboptimal.To address this, we introduce TracePile, a large-scale corpus of 2.6 million samples that transforms code execution into explicit, step-by-step chain-of-thought-style rationales, which we call Chain of Execution (CoE). The corpus spans domains including mathematics, classical algorithms and algorithmic competition, and is enriched with variable-tracing questions and code rewritings to enhance logical granularity and code diversity. We evaluate TracePile using three training setups: continue-pretraining, instruction tuning after pretraining, and two-stage finetuning. Experiments across four base models (LLaMA 3, LLaMA 3.1, Qwen-2.5, and Qwen-2.5 Coder) and 20 benchmarks covering math, code, logic, and algorithms demonstrate consistent improvements. Notably, TracePile boosts LLaMA3.1-8B by 7.1\% on average across nine math datasets and delivers clear gains on LiveCodeBench, CRUX, and MMLU under two-stage fine-tuning.
Abstract（参考訳）: 堅牢で一般的な推論能力の構築は、大規模言語モデル(LLM)の開発における中心的な目標である。近年の取り組みは、その固有の論理構造と、分割とコンカヤ、トポロジカルな順序付け、列挙といった多様な推論パラダイムを考えると、リッチなトレーニングソースとしてコードに変わりつつある。しかし、コード内の推論は暗黙的に表現され、構文や実装のノイズで絡まっており、生のコードで直接訓練される。これに対処するために、コード実行を明示的でステップバイステップのチェーン・オブ・シークレットな論理に変換する260万のサンプルからなる大規模なコーパスであるTracePileを紹介します。コーパスは、数学、古典的アルゴリズム、アルゴリズムの競争を含む領域にまたがっており、論理的な粒度とコードの多様性を高めるために、変数追跡の質問やコード書き換えが豊富である。我々は,継続事前訓練,事前訓練後の指導訓練,2段階微調整という3つのトレーニング設定を用いてTracePileを評価する。 4つのベースモデル(LLaMA 3, LLaMA 3.1, Qwen-2.5, Qwen-2.5 Coder)と数学、コード、ロジック、アルゴリズムをカバーする20のベンチマークによる実験は、一貫した改善を示している。特にTracePileは、9つの数学データセットで平均7.1\%のLLaMA3.1-8Bを向上し、2段階の微調整の下でLiveCodeBench、CRUX、MMLUで明確なゲインを提供する。

論文の概要: Chain of Execution Supervision Promotes General Reasoning in Large Language Models

関連論文リスト