Didactic to Constructive: Turning Expert Solutions into Learnable Reasoning
- URL: http://arxiv.org/abs/2602.02405v1
- Date: Mon, 02 Feb 2026 18:03:43 GMT
- Title: Didactic to Constructive: Turning Expert Solutions into Learnable Reasoning
- Authors: Ethan Mendes, Jungsoo Park, Alan Ritter,
- Abstract summary: We propose Distribution Aligned Learning (DAIL), a two-step method that bridges the distributional gap by transforming expert solutions into detailed, in-distribution reasoning traces.<n>We find that DAIL can leverage fewer than 1000 high-quality expert solutions to achieve 10-25% pass@k gains on Qwen2.5-Instruct and Qwen3 models, improve reasoning efficiency by 2x to 4x, and enable out-of-domain generalization.
- Score: 24.23048069764839
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Improving the reasoning capabilities of large language models (LLMs) typically relies either on the model's ability to sample a correct solution to be reinforced or on the existence of a stronger model able to solve the problem. However, many difficult problems remain intractable for even current frontier models, preventing the extraction of valid training signals. A promising alternative is to leverage high-quality expert human solutions, yet naive imitation of this data fails because it is fundamentally out of distribution: expert solutions are typically didactic, containing implicit reasoning gaps intended for human readers rather than computational models. Furthermore, high-quality expert solutions are expensive, necessitating generalizable sample-efficient training methods. We propose Distribution Aligned Imitation Learning (DAIL), a two-step method that bridges the distributional gap by first transforming expert solutions into detailed, in-distribution reasoning traces and then applying a contrastive objective to focus learning on expert insights and methodologies. We find that DAIL can leverage fewer than 1000 high-quality expert solutions to achieve 10-25% pass@k gains on Qwen2.5-Instruct and Qwen3 models, improve reasoning efficiency by 2x to 4x, and enable out-of-domain generalization.
Related papers
- Calibrated Reasoning: An Explanatory Verifier for Dynamic and Efficient Problem-Solving [2.357104785442987]
We propose a pairwise Explanatory Verifier that produces calibrated confidence scores and associated natural language reasoning for generated solutions.<n>Our verifier improves the accuracy and efficiency of test-time strategies like best-of-n and self-reflection.
arXiv Detail & Related papers (2025-09-24T01:36:00Z) - No Need for Learning to Defer? A Training Free Deferral Framework to Multiple Experts through Conformal Prediction [3.746889836344766]
We propose a training-free, model- and expert-agnostic framework for expert deferral based on conformal prediction.<n>Our method consistently outperforms both the standalone model and the strongest expert.
arXiv Detail & Related papers (2025-09-16T02:01:21Z) - Unlocking the Potential of Difficulty Prior in RL-based Multimodal Reasoning [69.64809103333839]
We investigate how explicitly modeling problem's difficulty prior information shapes the effectiveness of reinforcement learning based fine-tuning for multimodal reasoning.<n>Our approach demonstrates significant performances across various multi-modal mathematical reasoning benchmarks with only 2K+0.6K two-stage training data.
arXiv Detail & Related papers (2025-05-19T15:43:10Z) - Reasoning Paths Optimization: Learning to Reason and Explore From Diverse Paths [69.39559168050923]
We introduce Reasoning Paths Optimization (RPO), which enables learning to reason and explore from diverse paths.
Our approach encourages favorable branches at each reasoning step while penalizing unfavorable ones, enhancing the model's overall problem-solving performance.
We focus on multi-step reasoning tasks, such as math word problems and science-based exam questions.
arXiv Detail & Related papers (2024-10-07T06:37:25Z) - Learning Constrained Optimization with Deep Augmented Lagrangian Methods [54.22290715244502]
A machine learning (ML) model is trained to emulate a constrained optimization solver.
This paper proposes an alternative approach, in which the ML model is trained to predict dual solution estimates directly.
It enables an end-to-end training scheme is which the dual objective is as a loss function, and solution estimates toward primal feasibility, emulating a Dual Ascent method.
arXiv Detail & Related papers (2024-03-06T04:43:22Z) - Secrets of RLHF in Large Language Models Part II: Reward Modeling [134.97964938009588]
We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset.
We also introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses.
arXiv Detail & Related papers (2024-01-11T17:56:59Z) - Bridging the Novice-Expert Gap via Models of Decision-Making: A Case Study on Remediating Math Mistakes [4.19968291791323]
We use cognitive task analysis to translate an expert's latent thought process into a decision-making model for remediation.
This involves an expert identifying (A) the student's error, (B) a remediation strategy, and (C) their intention before generating a response.
We construct a dataset of 700 real tutoring conversations, annotated by experts with their decisions.
arXiv Detail & Related papers (2023-10-16T17:59:50Z) - Learning to Optimize Permutation Flow Shop Scheduling via Graph-based
Imitation Learning [70.65666982566655]
Permutation flow shop scheduling (PFSS) is widely used in manufacturing systems.
We propose to train the model via expert-driven imitation learning, which accelerates convergence more stably and accurately.
Our model's network parameters are reduced to only 37% of theirs, and the solution gap of our model towards the expert solutions decreases from 6.8% to 1.3% on average.
arXiv Detail & Related papers (2022-10-31T09:46:26Z) - A Mutual Information Maximization Approach for the Spurious Solution
Problem in Weakly Supervised Question Answering [60.768146126094955]
Weakly supervised question answering usually has only the final answers as supervision signals.
There may exist many spurious solutions that coincidentally derive the correct answer, but training on such solutions can hurt model performance.
We propose to explicitly exploit such semantic correlations by maximizing the mutual information between question-answer pairs and predicted solutions.
arXiv Detail & Related papers (2021-06-14T05:47:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.