Revisiting Generalization Across Difficulty Levels: It's Not So Easy
- URL: http://arxiv.org/abs/2511.21692v1
- Date: Wed, 26 Nov 2025 18:59:57 GMT
- Title: Revisiting Generalization Across Difficulty Levels: It's Not So Easy
- Authors: Yeganeh Kordi, Nihal V. Nayak, Max Zuo, Ilana Nguyen, Stephen H. Bach,
- Abstract summary: We investigate how well large language models generalize across different task difficulties.<n>We show that training on either easy or hard data cannot achieve consistent improvements across the full range of difficulties.
- Score: 11.203451380580868
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We investigate how well large language models (LLMs) generalize across different task difficulties, a key question for effective data curation and evaluation. Existing research is mixed regarding whether training on easier or harder data leads to better results, and whether those gains come on easier or harder test data. We address this question by conducting a systematic evaluation of LLMs' generalization across models, datasets, and fine-grained groups of example difficulty. We rank examples in six datasets using the outputs of thousands of different LLMs and Item Response Theory (IRT), a well-established difficulty metric in educational testing. Unlike prior work, our difficulty ratings are therefore determined solely by the abilities of many different LLMs, excluding human opinions of difficulty. With a more objective, larger-scale, and finer-grained analysis, we show that cross-difficulty generalization is often limited; training on either easy or hard data cannot achieve consistent improvements across the full range of difficulties. These results show the importance of having a range of difficulties in both training and evaluation data for LLMs, and that taking shortcuts with respect to difficulty is risky.
Related papers
- LLMs Encode How Difficult Problems Are [4.990590622073335]
We investigate whether large language models encode problem difficulty in a way that aligns with human judgment.<n>We train linear probes across layers and token positions on 60 models, evaluating on mathematical and coding subsets of Easy2HardBench.
arXiv Detail & Related papers (2025-10-20T22:48:23Z) - Probing the Difficulty Perception Mechanism of Large Language Models [31.945071671041465]
We investigate whether large language models implicitly encode problem difficulty in their internal representations.<n>We locate the specific attention heads of the final Transformer layer.<n>Experiments provide practical support for using LLMs as automatic difficulty annotators.
arXiv Detail & Related papers (2025-10-07T14:24:32Z) - Staying in the Sweet Spot: Responsive Reasoning Evolution via Capability-Adaptive Hint Scaffolding [59.60915947702282]
Reinforcement learning with verifiable rewards (RLVR) has achieved remarkable success in enhancing the reasoning capabilities of large language models (LLMs)<n>Existing RLVR methods often suffer from exploration inefficiency due to mismatches between the training data's difficulty and the model's capability.<n>We propose SEELE, a novel supervision-aided RLVR framework that dynamically adjusts problem difficulty to stay within the high-efficiency region.
arXiv Detail & Related papers (2025-09-08T17:36:21Z) - Climbing the Ladder of Reasoning: What LLMs Can-and Still Can't-Solve after SFT? [59.418994222096885]
We conduct a detailed analysis of model performance on the AIME24 dataset.<n>We categorize questions into four tiers (Easy, Medium, Hard, and Extremely Hard)<n>We find that progression from Easy to Medium tier requires adopting an R1 reasoning style with minimal SFT-1K instances.<n>Exh-level questions present a fundamentally different challenge; they require unconventional problem-solving skills.
arXiv Detail & Related papers (2025-04-16T03:39:38Z) - DAST: Difficulty-Aware Self-Training on Large Language Models [68.30467836807362]
Large Language Models (LLM) self-training methods always under-sample on challenging queries.<n>This work proposes a difficulty-aware self-training framework that focuses on improving the quantity and quality of self-generated responses.
arXiv Detail & Related papers (2025-03-12T03:36:45Z) - Easy2Hard-Bench: Standardized Difficulty Labels for Profiling LLM Performance and Generalization [126.27645170941268]
We present Easy2Hard-Bench, a collection of 6 benchmark datasets spanning various domains.<n>Each problem within these datasets is annotated with numerical difficulty scores.<n>We provide a comprehensive analysis of their performance and generalization capabilities across varying levels of difficulty.
arXiv Detail & Related papers (2024-09-27T03:49:56Z) - Can Large Language Models Always Solve Easy Problems if They Can Solve Harder Ones? [65.43882564649721]
Large language models (LLMs) have demonstrated impressive capabilities, but still suffer from inconsistency issues.
We develop the ConsisEval benchmark, where each entry comprises a pair of questions with a strict order of difficulty.
We analyze the potential for improvement in consistency by relative consistency score.
arXiv Detail & Related papers (2024-06-18T17:25:47Z) - The Unreasonable Effectiveness of Easy Training Data for Hard Tasks [84.30018805150607]
We present the surprising conclusion that current pretrained language models often generalize relatively well from easy to hard data.
We demonstrate this kind of easy-to-hard generalization using simple finetuning methods like in-context learning, linear heads, and QLoRA.
We conclude that easy-to-hard generalization in LMs is surprisingly strong for the tasks studied.
arXiv Detail & Related papers (2024-01-12T18:36:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.