Fugu-MT 論文翻訳(概要): Reasoning Through Chess: How Reasoning Evolves from Data Through Fine-Tuning and Reinforcement Learning

論文の概要: Reasoning Through Chess: How Reasoning Evolves from Data Through Fine-Tuning and Reinforcement Learning

arxiv url: http://arxiv.org/abs/2604.05134v1
Date: Mon, 06 Apr 2026 19:53:39 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-08 17:42:09.470325
Title: Reasoning Through Chess: How Reasoning Evolves from Data Through Fine-Tuning and Reinforcement Learning
Title（参考訳）: チェスを通しての推論:微調整と強化学習を通じてデータから推論がどのように進化するか
Authors: Lucas Dionisopoulos, Nicklas Majamaki, Prithviraj Ammanabrolu,
Abstract要約: 理論的に着想を得たデータセットの集合がチェスにおける言語モデルのパフォーマンスにどのように影響するかを分析する。最良の動きを直接予測するための微調整が、効率的なRLと最強のダウンストリーム性能につながることが分かりました。 RLは移動品質の分布にかなりの正の変化をもたらし, 副次効果として幻覚率を低下させることを示した。
参考スコア（独自算出の注目度）: 7.920254637344918
License: http://creativecommons.org/licenses/by/4.0/
Abstract: How can you get a language model to reason in a task it natively struggles with? We study how reasoning evolves in a language model -- from supervised fine-tuning (SFT) to reinforcement learning (RL) -- by analyzing how a set of theoretically-inspired datasets impacts language model performance in chess. We find that fine-tuning a model to directly predict the best move leads to effective RL and the strongest downstream performance -- however, the RL step elicits unfaithful reasoning (reasoning inconsistent with the chosen move). Alternatively, training on multi-move trajectories yields comparable downstream performance with faithful reasoning and more stable RL. We show that RL induces a substantial positive shift in the distribution of move quality and reduces hallucination rates as a side effect. Finally, we find several SFT-checkpoint metrics -- metrics spanning evaluation performance, hallucination rates, and reasoning quality -- to be predictive of post-RL model performance. We release checkpoints and final models as well as training data, evaluations, and code which allowed us to surpass leading open-source reasoning models in chess with a 7B-parameter model.
Abstract（参考訳）: ネイティブに苦労するタスクにおいて、どのように言語モデルを推論できるのでしょうか? 理論的にインスパイアされたデータセットのセットがチェスにおける言語モデルのパフォーマンスにどのように影響するかを分析することによって、言語モデルにおける推論が -- 教師付き微調整(SFT)から強化学習(RL)までどのように進化するかを研究する。最高の動きを直接予測するためのモデルを微調整することで、効果的なRLと最強のダウンストリームパフォーマンスにつながることが分かっています -- しかし、RLステップは、(選択した動きと矛盾する)不誠実な推論を招きます。あるいは、マルチモーブ軌道のトレーニングは、忠実な推論とより安定したRLで同等の下流性能が得られる。 RLは移動品質の分布にかなりの正の変化をもたらし, 副次効果として幻覚率を低下させることを示した。最後に、評価性能、幻覚率、推論品質にまたがるいくつかのSFTチェックポイント指標が、後RLモデルのパフォーマンスを予測できることがわかった。チェックポイントと最終モデルだけでなく,トレーニングデータや評価,コードもリリースしています。

論文の概要: Reasoning Through Chess: How Reasoning Evolves from Data Through Fine-Tuning and Reinforcement Learning

関連論文リスト