"The Whole Is Greater Than the Sum of Its Parts": A Compatibility-Aware Multi-Teacher CoT Distillation Framework
- URL: http://arxiv.org/abs/2601.13992v1
- Date: Tue, 20 Jan 2026 14:05:19 GMT
- Title: "The Whole Is Greater Than the Sum of Its Parts": A Compatibility-Aware Multi-Teacher CoT Distillation Framework
- Authors: Jin Cui, Jiaqi Guo, Jiepeng Zhou, Ruixuan Yang, Jiayi Lu, Jiajun Xu, Jiangcheng Song, Boran Zhao, Pengju Ren,
- Abstract summary: Chain-of-Thought (CoT) reasoning empowers Large Language Models (LLMs) with remarkable capabilities but typically requires prohibitive parameter scales.<n>CoT distillation has emerged as a promising paradigm to transfer reasoning prowess into compact Student Models (SLMs)<n>We introduce COMPACT, a framework that adaptively fuses supervisions from different teachers by dynamically weighting teacher gradients.
- Score: 16.96094045628127
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Chain-of-Thought (CoT) reasoning empowers Large Language Models (LLMs) with remarkable capabilities but typically requires prohibitive parameter scales. CoT distillation has emerged as a promising paradigm to transfer reasoning prowess into compact Student Models (SLMs), but existing approaches often rely on a solitary teacher, capping the student's potential since individual LLMs often exhibit distinct capability biases and may suffer from catastrophic forgetting. While leveraging diverse teachers seems appealing, effectively fusing their supervisions remains challenging: teacher-student incompatibility risks amplifying hallucinations, and passive supervision fails to ensure genuine logic internalization. To address this, we introduce COMPACT, a framework that adaptively fuses supervisions from different teachers by dynamically weighting teacher gradients based on the student's real-time compatibility evaluated by a multi-dimensional metric: (1) Graph-based Consensus to filter misleading rationales by identifying mainstream reasoning paths; (2) Mutual-Information-based Adaptability to detect "epiphany moments" for genuinely understanding the reasoning process rather than merely imitating; and (3) Loss-based Difficulty to assess student receptivity to the teacher's guidance and prevent negative transfer. Extensive experiments and latent space analysis demonstrate that COMPACT effectively integrates diverse reasoning capabilities without damaging the model's original knowledge structure, achieving state-of-the-art performance on various benchmarks while mitigating catastrophic forgetting.
Related papers
- Long-Chain Reasoning Distillation via Adaptive Prefix Alignment [57.130176131042965]
We propose a framework that exploits teacher CoTs for distillation through adaptive prefix alignment.<n>P-ALIGN adaptively truncates teacher-generated reasoning trajectories by determining whether the remaining suffix is concise.<n>Experiments on multiple mathematical reasoning benchmarks demonstrate that P-ALIGN outperforms all baselines by over 3%.
arXiv Detail & Related papers (2026-01-15T04:40:45Z) - MIND: From Passive Mimicry to Active Reasoning through Capability-Aware Multi-Perspective CoT Distillation [16.96094045628127]
Existing approaches restrict students to following a single golden rationale and treat different reasoning paths independently.<n>This misalignment leads to a degeneration of the student's latent reasoning distribution, causing suboptimal performance.<n>We propose MIND, a capability-filtered framework that transitions passive mimicry to active cognitive construction.
arXiv Detail & Related papers (2026-01-07T09:08:59Z) - Automatic Question Generation for Intuitive Learning Utilizing Causal Graph Guided Chain of Thought Reasoning [8.587087233323038]
We propose a novel framework that combines causal-graph-guided Chain-of-Thought reasoning with a multi-agent language model.<n>This approach ensures the generation of accurate, meaningful, and curriculum-aligned questions.<n> Experimental results demonstrate up to a 70% improvement in quality compared to reference methods.
arXiv Detail & Related papers (2026-01-02T08:49:58Z) - From Reasoning LLMs to BERT: A Two-Stage Distillation Framework for Search Relevance [20.096802351171377]
e-commerce search systems face strict latency requirements that prevent the direct application of Large Language Models.<n>We propose a two-stage reasoning distillation framework to transfer reasoning capabilities from a powerful teacher LLM to a lightweight, deployment-friendly student model.<n>Our framework achieves significant improvements across multiple metrics, validating its effectiveness and practical value.
arXiv Detail & Related papers (2025-10-13T06:46:43Z) - AdaSwitch: Adaptive Switching Generation for Knowledge Distillation [58.647880811071495]
Small language models (SLMs) are crucial for applications with strict latency and computational constraints.<n>We propose AdaSwitch, a novel approach that combines on-policy and off-policy generation at the token level.<n>AdaSwitch consistently improves accuracy, offering a practical and effective method for distilling SLMs with acceptable additional overhead.
arXiv Detail & Related papers (2025-10-09T06:38:37Z) - TRiCo: Triadic Game-Theoretic Co-Training for Robust Semi-Supervised Learning [15.638836465479619]
TRiCo is a novel triadic game-theoretic co-training framework that rethinks the structure of semi-supervised learning.<n>By addressing key limitations in existing SSL frameworks, TRiCo provides a principled and generalizable solution.
arXiv Detail & Related papers (2025-09-25T20:10:41Z) - Merge-of-Thought Distillation [23.53356244978525]
Merge-of-Thought Distillation (MoT) is a lightweight framework that alternates between teacher-specific supervised fine-tuning branches and weight-space merging the resulting student variants.<n>On competition math benchmarks, applying MoT to a Qwen3-14B student surpasses strong models including Deepseek-R1, Qwen3-32B, and OpenAI-O1.<n>MoT consistently outperforms the best single-teacher distillation, improves general reasoning beyond mathematics, and shows robustness to distribution-shifted and peer-level teachers.
arXiv Detail & Related papers (2025-09-10T17:46:57Z) - Enhancing Long-Chain Reasoning Distillation through Error-Aware Self-Reflection [64.73809794561305]
errOr-aware self-ReflectION (ORION) is a framework that refines teacher CoTs through an Error-Aware Reflection process.<n> Experiments on multiple mathematical reasoning benchmarks demonstrate that ORION consistently improves performance by more than 2% over all baselines.
arXiv Detail & Related papers (2025-05-28T08:57:03Z) - Co-Supervised Learning: Improving Weak-to-Strong Generalization with
Hierarchical Mixture of Experts [81.37287967870589]
We propose to harness a diverse set of specialized teachers, instead of a single generalist one, that collectively supervises the strong student.
Our approach resembles the classical hierarchical mixture of experts, with two components tailored for co-supervision.
We validate the proposed method through visual recognition tasks on the OpenAI weak-to-strong benchmark and additional multi-domain datasets.
arXiv Detail & Related papers (2024-02-23T18:56:11Z) - Contrastive Knowledge Amalgamation for Unsupervised Image Classification [2.6392087010521728]
Contrastive Knowledge Amalgamation (CKA) aims to learn a compact student model to handle the joint objective from multiple teacher models.
Contrastive losses intra- and inter- models are designed to widen the distance between representations of different classes.
The alignment loss is introduced to minimize the sample-level distribution differences of teacher-student models in the common representation space.
arXiv Detail & Related papers (2023-07-27T11:21:14Z) - Distantly-Supervised Named Entity Recognition with Adaptive Teacher
Learning and Fine-grained Student Ensemble [56.705249154629264]
Self-training teacher-student frameworks are proposed to improve the robustness of NER models.
In this paper, we propose an adaptive teacher learning comprised of two teacher-student networks.
Fine-grained student ensemble updates each fragment of the teacher model with a temporal moving average of the corresponding fragment of the student, which enhances consistent predictions on each model fragment against noise.
arXiv Detail & Related papers (2022-12-13T12:14:09Z) - From Mimicking to Integrating: Knowledge Integration for Pre-Trained
Language Models [55.137869702763375]
This paper explores a novel PLM reuse paradigm, Knowledge Integration (KI)
KI aims to merge the knowledge from different teacher-PLMs, each of which specializes in a different classification problem, into a versatile student model.
We then design a Model Uncertainty--aware Knowledge Integration (MUKI) framework to recover the golden supervision for the student.
arXiv Detail & Related papers (2022-10-11T07:59:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.