Related papers: Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment

Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment

URL: http://arxiv.org/abs/2601.14249v1
Date: Tue, 20 Jan 2026 18:58:10 GMT
Title: Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment
Authors: Yuming Yang, Mingyoung Lai, Wanxu Zhao, Xiaoran Fan, Zhiheng Xi, Mingqi Wu, Chiyue Huang, Jun Zhao, Haijun Lv, Jian Tong, Yunhua Zhou, Yicheng Zou, Qipeng Guo, Tao Gui, Qi Zhang, Xuanjing Huang,
Abstract summary: Rank-Surprisal Ratio is a simple metric that captures both alignment and informativeness to assess the suitability of a reasoning trajectory.<n>We demonstrate its practical utility in both trajectory selection and teacher selection.
Score: 82.00769536768509
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Long chain-of-thought (CoT) trajectories provide rich supervision signals for distilling reasoning from teacher to student LLMs. However, both prior work and our experiments show that trajectories from stronger teachers do not necessarily yield better students, highlighting the importance of data-student suitability in distillation. Existing methods assess suitability primarily through student likelihood, favoring trajectories that closely align with the model's current behavior but overlooking more informative ones. Addressing this, we propose Rank-Surprisal Ratio (RSR), a simple metric that captures both alignment and informativeness to assess the suitability of a reasoning trajectory. RSR is motivated by the observation that effective trajectories typically combine low absolute probability with relatively high-ranked tokens under the student model, balancing learning signal strength and behavioral alignment. Concretely, RSR is defined as the ratio of a trajectory's average token-wise rank to its average negative log-likelihood, and is straightforward to compute and interpret. Across five student models and reasoning trajectories from 11 diverse teachers, RSR strongly correlates with post-training performance (average Spearman 0.86), outperforming existing metrics. We further demonstrate its practical utility in both trajectory selection and teacher selection.

Related papers

Reinforcement-aware Knowledge Distillation for LLM Reasoning [63.53679456364683]
Reinforcement learning (RL) post-training has recently driven gains in long chain-of-thought reasoning large language models (LLMs)<n>Most existing knowledge distillation methods are designed for supervised fine-tuning (SFT), relying on fixed teacher traces or teacher-student Kullback-Leibler (KL) divergence-based regularization.<n>We propose RL-aware distillation (RLAD), which performs selective imitation during RL -- guiding the student toward the teacher only when it improves the current policy update.
arXiv Detail & Related papers (2026-02-26T00:20:39Z)
Beyond Correctness: Learning Robust Reasoning via Transfer [51.403609251508904]
We adopt a simple philosophical view, robust reasoning should remain useful beyond the mind that produced it.<n>We introduce Reinforcement Learning with Transferable Reward, which operationalizes robustness via transfer reward.<n>Our approach improves sampling consistency while improving final answer accuracy, and it reaches comparable performance in substantially fewer training steps.
arXiv Detail & Related papers (2026-02-09T10:41:44Z)
REDistill: Robust Estimator Distillation for Balancing Robustness and Efficiency [0.0]
We introduce REDistill, a principled framework grounded in robust statistics.<n>Redistill replaces the standard KD objective with a power divergence loss, a generalization of KL divergence.<n>Experiments on CIFAR-100 and ImageNet-1k demonstrate that REDistill consistently improves student accuracy in diverse teacher-student architectures.
arXiv Detail & Related papers (2026-02-04T15:50:53Z)
RAPTOR: Ridge-Adaptive Logistic Probes [37.64383880338739]
We propose RAPTOR, a simple L2-regularized logistic probe with validation-tuned ridge strength.<n>RAPTOR matches or exceeds strong baselines in accuracy while achieving competitive directional stability.<n>We provide a mechanistic characterization of ridge logistic regression in an idealized Gaussian teacher-student model in the high-dimensional few-shot regime.
arXiv Detail & Related papers (2026-01-29T19:20:27Z)
"The Whole Is Greater Than the Sum of Its Parts": A Compatibility-Aware Multi-Teacher CoT Distillation Framework [16.96094045628127]
Chain-of-Thought (CoT) reasoning empowers Large Language Models (LLMs) with remarkable capabilities but typically requires prohibitive parameter scales.<n>CoT distillation has emerged as a promising paradigm to transfer reasoning prowess into compact Student Models (SLMs)<n>We introduce COMPACT, a framework that adaptively fuses supervisions from different teachers by dynamically weighting teacher gradients.
arXiv Detail & Related papers (2026-01-20T14:05:19Z)
Long-Chain Reasoning Distillation via Adaptive Prefix Alignment [57.130176131042965]
We propose a framework that exploits teacher CoTs for distillation through adaptive prefix alignment.<n>P-ALIGN adaptively truncates teacher-generated reasoning trajectories by determining whether the remaining suffix is concise.<n>Experiments on multiple mathematical reasoning benchmarks demonstrate that P-ALIGN outperforms all baselines by over 3%.
arXiv Detail & Related papers (2026-01-15T04:40:45Z)
PACR: Progressively Ascending Confidence Reward for LLM Reasoning [55.06373646059141]
We propose Progressively Ascending Confidence Reward (PACR)<n>PACR is a dense, model-intrinsic reward computed directly from the model's evolving belief in the correct answer.<n>Our results suggest that dense, model-intrinsic shaping signals can make RLVR training more effective and reliable.
arXiv Detail & Related papers (2025-10-25T11:25:35Z)
Off-Trajectory Reasoning: Can LLMs Collaborate on Reasoning Trajectory? [13.980638430366625]
Reasoning LLMs are trained to verbalize their reasoning process, yielding strong gains on complex tasks.<n>A key prerequisite is the ability to assess the usefulness and build on another model's partial thinking.<n>This paper investigates the question: can standard solo-reasoning training pipelines deliver desired off-trajectory behaviors?
arXiv Detail & Related papers (2025-10-07T19:42:50Z)
Conditional Advantage Estimation for Reinforcement Learning in Large Reasoning Models [50.84995206660551]
We introduce Conditional advANtage estimatiON (CANON) to amplify the impact of a target metric without presuming its direction.<n>CANON based on entropy consistently outperforms prior methods on both math reasoning and high-complexity logic tasks.
arXiv Detail & Related papers (2025-09-28T16:33:07Z)
In Their Own Words: Reasoning Traces Tailored for Small Models Make Them Better Reasoners [12.995634497832027]
Transferring reasoning capabilities from larger language models to smaller ones often fails counterintuitively.<n>We identify that this failure stems from distributional misalignment: reasoning traces from larger models contain tokens that are low probability under the student's distribution.<n>We propose Reverse Speculative Decoding (RSD), a mechanism for generating student-friendly reasoning traces.
arXiv Detail & Related papers (2025-09-26T11:40:32Z)
Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales. We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z)
Improving Knowledge Distillation via Regularizing Feature Norm and Direction [16.98806338782858]
Knowledge distillation (KD) exploits a large well-trained model (i.e., teacher) to train a small student model on the same dataset for the same task. Treating teacher features as knowledge, prevailing methods of knowledge distillation train student by aligning its features with the teacher's, e.g., by minimizing the KL-divergence between their logits or L2 distance between their intermediate features. While it is natural to believe that better alignment of student features to the teacher better distills teacher knowledge, simply forcing this alignment does not directly contribute to the student's performance, e.g.
arXiv Detail & Related papers (2023-05-26T15:05:19Z)
SCOTT: Self-Consistent Chain-of-Thought Distillation [68.40232422158569]
Large language models (LMs) generate free-text rationales for their predictions via chain-of-thought prompting. We propose a faithful knowledge distillation method to learn a small, self-consistent CoT model from a teacher model that is orders of magnitude larger. To ensure faithful distillation, we use the teacher-generated rationales to learn a student LM with a counterfactual reasoning objective.
arXiv Detail & Related papers (2023-05-03T03:47:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.