Four Quadrants of Difficulty: A Simple Categorisation and its Limits
- URL: http://arxiv.org/abs/2601.01488v1
- Date: Sun, 04 Jan 2026 11:31:51 GMT
- Title: Four Quadrants of Difficulty: A Simple Categorisation and its Limits
- Authors: Vanessa Toborek, Sebastian Müller, Christian Bauckhage,
- Abstract summary: We propose a four-quadrant categorisation of difficulty signals -- human vs. model and task-agnostic vs. task-dependent.<n>We find that task-agnostic features behave largely independently and that only task-dependent features align.<n>These findings challenge common Curriculum Learning intuitions and highlight the need for lightweight, task-dependent difficulty estimators.
- Score: 4.304007567113229
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Curriculum Learning (CL) aims to improve the outcome of model training by estimating the difficulty of samples and scheduling them accordingly. In NLP, difficulty is commonly approximated using task-agnostic linguistic heuristics or human intuition, implicitly assuming that these signals correlate with what neural models find difficult to learn. We propose a four-quadrant categorisation of difficulty signals -- human vs. model and task-agnostic vs. task-dependent -- and systematically analyse their interactions on a natural language understanding dataset. We find that task-agnostic features behave largely independently and that only task-dependent features align. These findings challenge common CL intuitions and highlight the need for lightweight, task-dependent difficulty estimators that better reflect model learning behaviour.
Related papers
- LLMs Encode How Difficult Problems Are [4.990590622073335]
We investigate whether large language models encode problem difficulty in a way that aligns with human judgment.<n>We train linear probes across layers and token positions on 60 models, evaluating on mathematical and coding subsets of Easy2HardBench.
arXiv Detail & Related papers (2025-10-20T22:48:23Z) - CHUCKLE -- When Humans Teach AI To Learn Emotions The Easy Way [11.645594774207511]
We propose CHUCKLE (Crowdsourced Human Understanding Curriculum for Knowledge Led Emotion Recognition), a perception-driven CL framework for emotion recognition.<n>We show that CHUCKLE increases the relative mean accuracy by 6.56% for LSTMs and 1.61% for Transformers over non-curriculum baselines, while reducing the number of gradient updates.
arXiv Detail & Related papers (2025-10-10T13:38:06Z) - Your Pretrained Model Tells the Difficulty Itself: A Self-Adaptive Curriculum Learning Paradigm for Natural Language Understanding [53.63482987410292]
We present a self-adaptive curriculum learning paradigm that prioritizes fine-tuning examples based on difficulty scores predicted by pre-trained language models.<n>We evaluate our method on four natural language understanding (NLU) datasets covering both binary and multi-class classification tasks.
arXiv Detail & Related papers (2025-07-13T19:36:17Z) - Climbing the Ladder of Reasoning: What LLMs Can-and Still Can't-Solve after SFT? [59.418994222096885]
We conduct a detailed analysis of model performance on the AIME24 dataset.<n>We categorize questions into four tiers (Easy, Medium, Hard, and Extremely Hard)<n>We find that progression from Easy to Medium tier requires adopting an R1 reasoning style with minimal SFT-1K instances.<n>Exh-level questions present a fundamentally different challenge; they require unconventional problem-solving skills.
arXiv Detail & Related papers (2025-04-16T03:39:38Z) - DAST: Difficulty-Aware Self-Training on Large Language Models [68.30467836807362]
Large Language Models (LLM) self-training methods always under-sample on challenging queries.<n>This work proposes a difficulty-aware self-training framework that focuses on improving the quantity and quality of self-generated responses.
arXiv Detail & Related papers (2025-03-12T03:36:45Z) - Exploring the Potential of Large Language Models for Estimating the Reading Comprehension Question Difficulty [2.335292678914151]
This study investigates the effectiveness of Large Language Models (LLMs) in estimating the difficulty of reading comprehension questions.<n>We use OpenAI's GPT-4o and o1, in estimating the difficulty of reading comprehension questions using the Study Aid and Reading Assessment (SARA) dataset.<n>The results indicate that while the models yield difficulty estimates that align meaningfully with derived IRT parameters, there are notable differences in their sensitivity to extreme item characteristics.
arXiv Detail & Related papers (2025-02-25T02:28:48Z) - What Makes Good Contrastive Learning on Small-Scale Wearable-based
Tasks? [59.51457877578138]
We study contrastive learning on the wearable-based activity recognition task.
This paper presents an open-source PyTorch library textttCL-HAR, which can serve as a practical tool for researchers.
arXiv Detail & Related papers (2022-02-12T06:10:15Z) - Generalization of Neural Combinatorial Solvers Through the Lens of
Adversarial Robustness [68.97830259849086]
Most datasets only capture a simpler subproblem and likely suffer from spurious features.
We study adversarial robustness - a local generalization property - to reveal hard, model-specific instances and spurious features.
Unlike in other applications, where perturbation models are designed around subjective notions of imperceptibility, our perturbation models are efficient and sound.
Surprisingly, with such perturbations, a sufficiently expressive neural solver does not suffer from the limitations of the accuracy-robustness trade-off common in supervised learning.
arXiv Detail & Related papers (2021-10-21T07:28:11Z) - Learning What Makes a Difference from Counterfactual Examples and
Gradient Supervision [57.14468881854616]
We propose an auxiliary training objective that improves the generalization capabilities of neural networks.
We use pairs of minimally-different examples with different labels, a.k.a counterfactual or contrasting examples, which provide a signal indicative of the underlying causal structure of the task.
Models trained with this technique demonstrate improved performance on out-of-distribution test sets.
arXiv Detail & Related papers (2020-04-20T02:47:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.