Do Instruction-Tuned Models Always Perform Better Than Base Models? Evidence from Math and Domain-Shifted Benchmarks
- URL: http://arxiv.org/abs/2601.13244v1
- Date: Mon, 19 Jan 2026 17:26:49 GMT
- Title: Do Instruction-Tuned Models Always Perform Better Than Base Models? Evidence from Math and Domain-Shifted Benchmarks
- Authors: Prateek Munjal, Clement Christophe, Ronnie Rajan, Praveenkumar Kanithi,
- Abstract summary: We evaluate base and instruction-tuned models on standard math benchmarks, structurally perturbed variants, and domain-shifted tasks.<n>Our results show that base models surpass instruction-tuned variants on the domain-specific MedCalc benchmark.<n>In instruction-tuned models show sharp declines on perturbed datasets, indicating sensitivity to prompt structure over robust reasoning.
- Score: 0.6536121591910934
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Instruction finetuning is standard practice for improving LLM performance, yet it remains unclear whether it enhances reasoning or merely induces surface-level pattern matching. We investigate this by evaluating base and instruction-tuned models on standard math benchmarks, structurally perturbed variants, and domain-shifted tasks. Our analysis highlights two key (often overlooked) limitations of instruction tuning. First, the performance advantage is unstable and depends heavily on evaluation settings. In zero-shot CoT settings on GSM8K, base models consistently outperform instruction-tuned variants, with drops as high as 32.67\% (Llama3-70B). Instruction-tuned models only match or exceed this performance when provided with few-shot exemplars, suggesting a reliance on specific prompting patterns rather than intrinsic reasoning. Second, tuning gains are brittle under distribution shift. Our results show that base models surpass instruction-tuned variants on the domain-specific MedCalc benchmark. Additionally, instruction-tuned models show sharp declines on perturbed datasets, indicating sensitivity to prompt structure over robust reasoning.
Related papers
- IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation [85.56193980646981]
We propose IF-RewardBench, a comprehensive meta-evaluation benchmark for instruction-following.<n>For each instruction, we construct a preference graph containing all pairwise preferences among multiple responses.<n>Experiments on IF-RewardBench reveal significant deficiencies in current judge models.
arXiv Detail & Related papers (2026-03-05T02:21:17Z) - On the Effect of Instruction Tuning Loss on Generalization [22.288479270814484]
We show that the standard instruction tuning loss often yields suboptimal performance and limited robustness to input prompt variations.<n>We find that a low-to-moderate weight for prompt tokens coupled with a moderate-to-high weight for response tokens yields the best-performing models across settings.
arXiv Detail & Related papers (2025-07-10T14:46:33Z) - Shadow-FT: Tuning Instruct Model via Training on Paired Base Model [67.20706292627106]
Large language models (LLMs) consistently benefit from further fine-tuning on various tasks.<n>We propose a novel Shadow-FT framework to tune the Instruct models by leveraging the corresponding Base models.<n>Our proposed Shadow-FT introduces no additional parameters, is easy to implement, and significantly improves performance.
arXiv Detail & Related papers (2025-05-19T05:16:21Z) - Fine-Tuning on Diverse Reasoning Chains Drives Within-Inference CoT Refinement in LLMs [63.36637269634553]
We introduce a novel approach where LLMs are fine-tuned to generate a sequence of Diverse Chains of Thought (DCoT) within a single inference step.<n>We show that fine-tuning on DCoT improves performance over the CoT baseline across model families and scales.<n>Our work is also significant because both quantitative analyses and manual evaluations reveal the observed gains stem from the models' ability to refine an initial reasoning chain.
arXiv Detail & Related papers (2024-07-03T15:01:18Z) - Disperse-Then-Merge: Pushing the Limits of Instruction Tuning via Alignment Tax Reduction [75.25114727856861]
Large language models (LLMs) tend to suffer from deterioration at the latter stage ofSupervised fine-tuning process.
We introduce a simple disperse-then-merge framework to address the issue.
Our framework outperforms various sophisticated methods such as data curation and training regularization on a series of standard knowledge and reasoning benchmarks.
arXiv Detail & Related papers (2024-05-22T08:18:19Z) - Becoming self-instruct: introducing early stopping criteria for minimal
instruct tuning [0.0]
We introduce the Instruction Following Score (IFS), a metric that detects language models' ability to follow instructions.
We benchmark publicly available base and instruct models, and show that the ratio of well formatted responses to partial and full sentences can be an effective measure.
We compute IFS for Supervised Fine-Tuning (SFT) of 7B and 13B LLaMA models, showing that models learn to follow instructions relatively early in the training process.
arXiv Detail & Related papers (2023-07-05T09:42:25Z) - Evaluating the Zero-shot Robustness of Instruction-tuned Language Models [23.488398944358643]
We find that using novel (unobserved) but appropriate instruction phrasings consistently degrades model performance.
We propose a simple method to mitigate this issue by introducing soft prompt'' embedding parameters.
We show that this method consistently improves the robustness of instruction-tuned models.
arXiv Detail & Related papers (2023-06-20T03:48:51Z) - Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for
Large Language Models [125.91897197446379]
We find that MoE models benefit more from instruction tuning than dense models.
Our most powerful model, FLAN-MOE-32B, surpasses the performance of FLAN-PALM-62B on four benchmark tasks.
arXiv Detail & Related papers (2023-05-24T04:22:26Z) - On the Limits of Evaluating Embodied Agent Model Generalization Using
Validation Sets [101.28658250723804]
This paper experiments with augmenting a transformer model with modules that effectively utilize a wider field of view and learn to choose whether the next step requires a navigation or manipulation action.
We observe that the proposed modules resulted in improved, and in fact state-of-the-art performance on an unseen validation set of a popular benchmark dataset, ALFRED.
We highlight this result as we believe it may be a wider phenomenon in machine learning tasks but primarily noticeable only in benchmarks that limit evaluations on test splits.
arXiv Detail & Related papers (2022-05-18T23:52:21Z) - Interpretable Learning-to-Rank with Generalized Additive Models [78.42800966500374]
Interpretability of learning-to-rank models is a crucial yet relatively under-examined research area.
Recent progress on interpretable ranking models largely focuses on generating post-hoc explanations for existing black-box ranking models.
We lay the groundwork for intrinsically interpretable learning-to-rank by introducing generalized additive models (GAMs) into ranking tasks.
arXiv Detail & Related papers (2020-05-06T01:51:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.