Related papers: Thinking with DistilQwen: A Tale of Four Distilled Reasoning and Reward Model Series

Thinking with DistilQwen: A Tale of Four Distilled Reasoning and Reward Model Series

URL: http://arxiv.org/abs/2511.01354v1
Date: Mon, 03 Nov 2025 09:00:51 GMT
Title: Thinking with DistilQwen: A Tale of Four Distilled Reasoning and Reward Model Series
Authors: Wenrui Cai, Chengyu Wang, Junbing Yan, Jun Huang, Xiangzhong Fang,
Abstract summary: We introduce four model series specifically designed to meet industrial requirements.<n>The DistilQwen model collection comprises: (1) slow-thinking models, optimized for reasoning tasks that require high accuracy; (2) two series of adaptive-thinking models, which dynamically adjust reasoning strategies based on input tasks to maximize efficiency across diverse scenarios; and (3) distilled reward models, which enable further reinforcement learning of reasoning models using distilled knowledge.
Score: 15.763018008675083
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, the demand for small and efficient reasoning models to support real-world applications has driven the development of knowledge distillation techniques that balance reasoning performance and inference speed. In this paper, we further extend the DistilQwen model family, initialized from the Qwen models, by introducing four model series specifically designed to meet industrial requirements. The distilled model collection comprises: (1) slow-thinking models, optimized for reasoning tasks that require high accuracy; (2) two series of adaptive-thinking models, which dynamically adjust reasoning strategies based on input tasks to maximize efficiency across diverse scenarios; and (3) distilled reward models, which enable further reinforcement learning of reasoning models using distilled knowledge. Comprehensive evaluations across multiple benchmarks demonstrate both high inference efficiency and strong reasoning performance for these models, as well as the practical utility of distilled reward models. We further show that these models support industry practitioners by providing scalable training and inference functionalities on the Alibaba Cloud PAI (Platform for Artificial Intelligence) platform.

Related papers

A Survey on Efficient Vision-Language-Action Models [153.11669266922993]
Vision-Language-Action models (VLAs) represent a significant frontier in embodied intelligence, aiming to bridge digital knowledge with physical-world interaction.<n>Motivated by the urgent need to address these challenges, this survey presents the first comprehensive review of Efficient Vision-Language-Action models.
arXiv Detail & Related papers (2025-10-27T17:57:33Z)
The Thinking Spectrum: An Empirical Study of Tunable Reasoning in LLMs through Model Merging [8.930191971732649]
We present a large-scale empirical study evaluating a range of model merging techniques across multiple reasoning benchmarks.<n>Our findings reveal that model merging offers an effective and controllable method for calibrating the trade-off between reasoning accuracy and token efficiency.<n>Our study provides the first comprehensive analysis of this tunable space, offering practical guidelines for creating LLMs with specific reasoning profiles.
arXiv Detail & Related papers (2025-09-26T08:12:13Z)
Libra: Assessing and Improving Reward Model by Learning to Think [37.22776255575947]
We present a reasoning-oriented benchmark (Libra Bench) to address the limitations of existing reward model benchmarks in reasoning scenarios.<n>We introduce a novel approach for improving the generative reward model via learning-to-think methodologies.<n>We develop Libra-RM series, a collection of generative reward models with reasoning capabilities that achieve state-of-the-art results on various benchmarks.
arXiv Detail & Related papers (2025-07-29T10:02:43Z)
LARES: Latent Reasoning for Sequential Recommendation [96.26996622771593]
We present LARES, a novel and scalable LAtent REasoning framework for Sequential recommendation.<n>Our proposed approach employs a recurrent architecture that allows flexible expansion of reasoning depth without increasing parameter complexity.<n>Our framework exhibits seamless compatibility with existing advanced models, further improving their recommendation performance.
arXiv Detail & Related papers (2025-05-22T16:22:54Z)
Leveraging Reasoning Model Answers to Enhance Non-Reasoning Model Capability [16.441081996257576]
We propose leveraging reasoning-intensive models to improve less computationally demanding, non-reasoning models.<n>We demonstrate consistent improvements across various benchmarks, underscoring the potential of this approach for advancing the ability of models to answer questions directly.
arXiv Detail & Related papers (2025-04-13T16:26:56Z)
Generative Models in Decision Making: A Survey [63.68746774576147]
generative models can be incorporated into decision-making systems by generating trajectories that guide agents toward high-reward state-action regions or intermediate sub-goals.<n>This paper presents a comprehensive review of the application of generative models in decision-making tasks.
arXiv Detail & Related papers (2025-02-24T12:31:28Z)
Learning-based Models for Vulnerability Detection: An Extensive Study [3.1317409221921144]
We extensively and comprehensively investigate two types of state-of-the-art learning-based approaches. We experimentally demonstrate the priority of sequence-based models and the limited abilities of both graph-based models.
arXiv Detail & Related papers (2024-08-14T13:01:30Z)
When Parameter-efficient Tuning Meets General-purpose Vision-language Models [65.19127815275307]
PETAL revolutionizes the training process by requiring only 0.5% of the total parameters, achieved through a unique mode approximation technique. Our experiments reveal that PETAL not only outperforms current state-of-the-art methods in most scenarios but also surpasses full fine-tuning models in effectiveness.
arXiv Detail & Related papers (2023-12-16T17:13:08Z)
Revisiting Implicit Models: Sparsity Trade-offs Capability in Weight-tied Model for Vision Tasks [4.872984658007499]
Implicit models such as Deep Equilibrium Models (DEQs) have garnered significant attention in the community for their ability to train infinite layer models. We revisit the line of implicit models and trace them back to the original weight-tied models. Surprisingly, we observe that weight-tied models are more effective, stable, as well as efficient on vision tasks, compared to the DEQ variants.
arXiv Detail & Related papers (2023-07-16T11:45:35Z)
Minimal Value-Equivalent Partial Models for Scalable and Robust Planning in Lifelong Reinforcement Learning [56.50123642237106]
Common practice in model-based reinforcement learning is to learn models that model every aspect of the agent's environment. We argue that such models are not particularly well-suited for performing scalable and robust planning in lifelong reinforcement learning scenarios. We propose new kinds of models that only model the relevant aspects of the environment, which we call "minimal value-minimal partial models"
arXiv Detail & Related papers (2023-01-24T16:40:01Z)
Hybrid modeling: Applications in real-time diagnosis [64.5040763067757]
We outline a novel hybrid modeling approach that combines machine learning inspired models and physics-based models. We are using such models for real-time diagnosis applications.
arXiv Detail & Related papers (2020-03-04T00:44:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.