Fugu-MT 論文翻訳(概要): Reasoning Pattern Matters: Learning to Reason without Human Rationales

論文の概要: Reasoning Pattern Matters: Learning to Reason without Human Rationales

arxiv url: http://arxiv.org/abs/2510.12643v1
Date: Tue, 14 Oct 2025 15:34:38 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-15 19:02:32.37754
Title: Reasoning Pattern Matters: Learning to Reason without Human Rationales
Title（参考訳）: パターンを推論する:人間の合理性なしに推論を学ぶ
Authors: Chaoxu Pang, Yixuan Cao, Ping Luo,
Abstract要約: 大規模言語モデル(LLM)は、広く採用されているSFT+RLVRパラダイムの下で顕著な推論能力を示している。本稿では,推論性能を損なうことなく,合理的アノテーションのコストを大幅に削減する方法について検討する。
参考スコア（独自算出の注目度）: 27.684703630371043
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have demonstrated remarkable reasoning capabilities under the widely adopted SFT+RLVR paradigm, which first performs Supervised Fine-Tuning (SFT) on human-annotated reasoning trajectories (rationales) to establish initial reasoning behaviors, then applies Reinforcement Learning with Verifiable Rewards (RLVR) to optimize the model using verifiable signals without golden rationales. However, annotating high-quality rationales for the SFT stage remains prohibitively expensive. This paper investigates when and how rationale annotation costs can be substantially reduced without compromising reasoning performance. We identify a broad class of problems, termed patterned reasoning tasks, where reasoning follows a fixed, procedural strategy consistent across instances. Although instances vary in content such as domain knowledge, factual information, or numeric values, the solution derives from applying a shared reasoning pattern. We argue that the success of SFT+RLVR on such tasks primarily stems from its ability to enable models to internalize these reasoning patterns. Using numerical semantic matching as a representative task, we provide both causal and behavioral evidence showing that reasoning patterns rather than the quantity or quality of rationales are the key determinant of performance. Building on these insights, we propose Pattern-Aware LLMs as Rationale AnnOtators (PARO), a simple yet effective framework that enables LLMs to generate rationales aligned with task-specific reasoning patterns without requiring human rationale annotations. Experiments show that PARO-generated rationales achieve comparable SFT+RLVR performance to human rationales that are 10 times larger. These results suggest that large-scale human rationale annotations can be replaced with LLM-based automatic annotations requiring only limited human supervision over reasoning patterns.
Abstract（参考訳）: 大規模言語モデル (LLM) は、まず人間に注釈を付けた推論軌跡 (rationales) にスーパーバイザード・ファインタニング (SFT) を施し、最初の推論行動を確立し、次にRLVR (Reinforcement Learning with Verifiable Rewards) を適用して、金の有理性のない検証可能な信号を用いてモデルを最適化する、広く採用されているSFT+RLVRパラダイムの下で、顕著な推論能力を示した。しかしながら、SFTステージの高品質な論理を注釈付けすることは、違法に高価である。本稿では,推論性能を損なうことなく,合理的アノテーションのコストを大幅に削減する方法について検討する。我々は、パターン推論タスクと呼ばれる幅広い問題のクラスを特定し、推論はインスタンス間で一貫性のある固定された手続き的戦略に従う。例はドメイン知識、事実情報、数値値などの内容によって異なるが、ソリューションは共通の推論パターンを適用することから導かれる。このようなタスクにおけるSFT+RLVRの成功は主に、モデルがこれらの推論パターンを内部化できるようにする能力に起因している、と我々は主張する。数値的セマンティックマッチングを代表課題として用いて,理性量や品質よりも推論パターンが性能の重要な決定要因であることを示す因果的および行動的証拠を提示する。これらの知見に基づいて,LLM を Rationale AnnOtators (PARO) としてパターン認識 LLM を提案する。実験により、PARO生成した有理数では、人間の10倍の有理数に対して、SFT+RLVRの性能が達成された。これらの結果から,大規模人為的合理化アノテーションをLLMに基づく自動アノテーションに置き換えるには,推論パターンに対する人的監督が限定されることが示唆された。

論文の概要: Reasoning Pattern Matters: Learning to Reason without Human Rationales

関連論文リスト