Not All Correct Answers Are Equal: Why Your Distillation Source Matters
- URL: http://arxiv.org/abs/2505.14464v2
- Date: Thu, 22 May 2025 03:46:05 GMT
- Title: Not All Correct Answers Are Equal: Why Your Distillation Source Matters
- Authors: Xiaoyu Tian, Yunjie Ji, Haotian Wang, Shuaiting Chen, Sitong Zhao, Yiping Peng, Han Zhao, Xiangang Li,
- Abstract summary: Distillation has emerged as a practical and effective approach to enhance the reasoning capabilities of open-source language models.<n>We collect verified outputs from three state-of-the-art teacher models-AM-Thinking-v1, Qwen3-235B-A22B, and DeepSeek-R1-on a shared corpus of 1.89 million queries.<n>Student models trained on each dataset are evaluated on reasoning benchmarks including AIME2024, AIME2025, MATH500, and LiveCodeBench.
- Score: 16.441081996257576
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Distillation has emerged as a practical and effective approach to enhance the reasoning capabilities of open-source language models. In this work, we conduct a large-scale empirical study on reasoning data distillation by collecting verified outputs from three state-of-the-art teacher models-AM-Thinking-v1, Qwen3-235B-A22B, and DeepSeek-R1-on a shared corpus of 1.89 million queries. We construct three parallel datasets and analyze their distributions, revealing that AM-Thinking-v1-distilled data exhibits greater token length diversity and lower perplexity. Student models trained on each dataset are evaluated on reasoning benchmarks including AIME2024, AIME2025, MATH500, and LiveCodeBench. The model distilled from AM-Thinking-v1 consistently achieves the best performance (e.g., 84.3 on AIME2024, 72.2 on AIME2025, 98.4 on MATH500, and 65.9 on LiveCodeBench) and demonstrates adaptive output behavior-producing longer responses for harder tasks and shorter ones for simpler tasks. These findings highlight the value of high-quality, verified reasoning traces. We release the AM-Thinking-v1 and Qwen3-235B-A22B distilled datasets to support future research on open and high-performing reasoning-oriented language models. The datasets are publicly available on Hugging Face\footnote{Datasets are available on Hugging Face: \href{https://huggingface.co/datasets/a-m-team/AM-Thinking-v1-Distilled}{AM-Thinking-v1-Distilled}, \href{https://huggingface.co/datasets/a-m-team/AM-Qwen3-Distilled}{AM-Qwen3-Distilled}.}.
Related papers
- Enhancing Scientific Visual Question Answering through Multimodal Reasoning and Ensemble Modeling [0.0]
Current approaches to visual question answering often struggle with the precision required for scientific data interpretation.<n>We present our approach to the SciVQA 2025 shared task, focusing on answering visual and non-visual questions grounded in scientific figures from scholarly articles.<n>Our findings underscore the effectiveness of prompt optimization, chain-of-thought reasoning and ensemble modeling in improving the model's ability in visual question answering.
arXiv Detail & Related papers (2025-07-08T17:05:42Z) - OpenThoughts: Data Recipes for Reasoning Models [215.16652796083164]
OpenThoughts project is to create open-source datasets for training reasoning models.<n>OpenThoughts2-1M dataset led to OpenThinker2-32B, the first model trained on public reasoning data.<n>OpenThoughts3-7B model, which achieves state-of-the-art results.
arXiv Detail & Related papers (2025-06-04T17:25:39Z) - Harnessing Negative Signals: Reinforcement Distillation from Teacher Data for LLM Reasoning [21.70706473875226]
We propose Reinforcement Distillation (REDI), a two-stage framework.<n> Stage 1 learns from positive traces via Supervised Fine-Tuning (SFT)<n>Stage 2 further refines the model using both positive and negative traces through our proposed REDI objective.<n>Our empirical evaluations demonstrate REDI's superiority over baseline Rejection Sampling SFT or SFT combined with DPO/SimPO on mathematical reasoning tasks.
arXiv Detail & Related papers (2025-05-30T17:47:17Z) - Reinforcement Learning for Reasoning in Large Language Models with One Training Example [133.018487956408]
We show that reinforcement learning with verifiable reward using one training example (1-shot RLVR) is effective in incentivizing the mathematical reasoning capabilities of large language models (LLMs)<n>We identify some interesting phenomena during 1-shot RLVR, including cross-domain generalization, increased frequency of self-reflection, and sustained test performance improvement even after the training accuracy has saturated.
arXiv Detail & Related papers (2025-04-29T09:24:30Z) - DeepDistill: Enhancing LLM Reasoning Capabilities via Large-Scale Difficulty-Graded Data Training [16.441081996257576]
Large language models (LLMs) have recently achieved remarkable performance on various complex reasoning benchmarks.<n>We construct a large-scale, difficulty-graded reasoning dataset containing about 3.34 million unique queries of varying difficulty levels.<n>We significantly improve the reasoning capabilities of the base model, achieving a pass rate of 79.2% on the AIME2024 mathematical reasoning benchmark.
arXiv Detail & Related papers (2025-04-24T13:57:53Z) - 1.4 Million Open-Source Distilled Reasoning Dataset to Empower Large Language Model Training [16.441081996257576]
The AM-DeepSeek-R1-Distilled is a large-scale dataset with thinking traces for general reasoning tasks.<n>The AM-Distill-Qwen-32B model, which was trained through only simple Supervised Fine-Tuning (SFT), outperformed the DeepSeek-R1-Distill-Qwen-32B model on four benchmarks.
arXiv Detail & Related papers (2025-03-25T13:19:46Z) - MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale [66.73529246309033]
multimodal large language models (MLLMs) have shown significant potential in a broad range of multimodal tasks.<n>Existing instruction-tuning datasets only provide phrase-level answers without any intermediate rationales.<n>We introduce a scalable and cost-effective method to construct a large-scale multimodal instruction-tuning dataset with rich intermediate rationales.
arXiv Detail & Related papers (2024-12-06T18:14:24Z) - Tool-Augmented Reward Modeling [58.381678612409]
We propose a tool-augmented preference modeling approach, named Themis, to address limitations by empowering RMs with access to external environments.
Our study delves into the integration of external tools into RMs, enabling them to interact with diverse external sources.
In human evaluations, RLHF trained with Themis attains an average win rate of 32% when compared to baselines.
arXiv Detail & Related papers (2023-10-02T09:47:40Z) - Impossible Distillation: from Low-Quality Model to High-Quality Dataset & Model for Summarization and Paraphrasing [59.58984194238254]
We present Impossible Distillation, a novel framework for paraphrasing and sentence summarization.
Unlike prior works that rely on an extreme-scale teacher model, we hypothesize and verify the paraphrastic proximity intrinsic to pre-trained LMs.
By identifying and distilling generations from these subspaces, Impossible Distillation produces a high-quality dataset and model even from GPT2-scale LMs.
arXiv Detail & Related papers (2023-05-26T05:19:24Z) - AMR Parsing via Graph-Sequence Iterative Inference [62.85003739964878]
We propose a new end-to-end model that treats AMR parsing as a series of dual decisions on the input sequence and the incrementally constructed graph.
We show that the answers to these two questions are mutually causalities.
We design a model based on iterative inference that helps achieve better answers in both perspectives, leading to greatly improved parsing accuracy.
arXiv Detail & Related papers (2020-04-12T09:15:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.