Learning on the Job: Test-Time Curricula for Targeted Reinforcement Learning
- URL: http://arxiv.org/abs/2510.04786v1
- Date: Mon, 06 Oct 2025 13:07:14 GMT
- Title: Learning on the Job: Test-Time Curricula for Targeted Reinforcement Learning
- Authors: Jonas Hübotter, Leander Diaz-Bone, Ido Hakimi, Andreas Krause, Moritz Hardt,
- Abstract summary: We propose an agent that assembles a task-specific curriculum and applies reinforcement learning to continue training the model for its target task.<n>The test-time curriculum avoids time-consuming human curation of datasets by automatically selecting the most task-relevant data.<n>Our experiments demonstrate that reinforcement learning on a test-time curriculum consistently improves the model on its target tasks.
- Score: 51.34470146576741
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Humans are good at learning on the job: We learn how to solve the tasks we face as we go along. Can a model do the same? We propose an agent that assembles a task-specific curriculum, called test-time curriculum (TTC-RL), and applies reinforcement learning to continue training the model for its target task. The test-time curriculum avoids time-consuming human curation of datasets by automatically selecting the most task-relevant data from a large pool of available training data. Our experiments demonstrate that reinforcement learning on a test-time curriculum consistently improves the model on its target tasks, across a variety of evaluations and models. Notably, on challenging math and coding benchmarks, TTC-RL improves the pass@1 of Qwen3-8B by approximately 1.8x on AIME25 and 2.1x on CodeElo. Moreover, we find that TTC-RL significantly raises the performance ceiling compared to the initial model, increasing pass@8 on AIME25 from 40% to 62% and on CodeElo from 28% to 43%. Our findings show the potential of test-time curricula in extending the test-time scaling paradigm to continual training on thousands of task-relevant experiences during test-time.
Related papers
- Understanding the Role of Training Data in Test-Time Scaling [56.12341509545198]
We study the performance of test-time scaling for transformers trained on an in-context weight prediction task for linear regression.<n>We show that training on a diverse, relevant, and hard set of tasks results in best performance for test-time scaling.
arXiv Detail & Related papers (2025-10-04T01:38:48Z) - The Path of Self-Evolving Large Language Models: Achieving Data-Efficient Learning via Intrinsic Feedback [51.144727949988436]
Reinforcement learning (RL) has demonstrated potential to enhance the reasoning capabilities of large language models (LLMs)<n>In this work, we explore improving LLMs through RL with minimal data.<n>To minimize data dependency, we introduce two novel mechanisms grounded in self-awareness.
arXiv Detail & Related papers (2025-10-03T06:32:10Z) - Test-time Offline Reinforcement Learning on Goal-related Experience [50.94457794664909]
Research in foundation models has shown that performance can be substantially improved through test-time training.<n>We propose a novel self-supervised data selection criterion, which selects transitions from an offline dataset according to their relevance to the current state.<n>Our goal-conditioned test-time training (GC-TTT) algorithm applies this routine in a receding-horizon fashion during evaluation, adapting the policy to the current trajectory as it is being rolled out.
arXiv Detail & Related papers (2025-07-24T21:11:39Z) - LearnAlign: Reasoning Data Selection for Reinforcement Learning in Large Language Models Based on Improved Gradient Alignment [14.655048266761783]
Reinforcement learning (RL) has become a key technique for enhancing LLMs' reasoning abilities, yet its data inefficiency remains a major bottleneck.<n>We present LearnAlign, which intelligently selects the learnable and representative training reasoning data for RL post-training.<n> Experiments across three mathematical reasoning benchmarks demonstrate that our method significantly reduces training data requirements.
arXiv Detail & Related papers (2025-06-13T06:05:58Z) - AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning [50.02117478165099]
We show that large-scale reinforcement learning can significantly enhance the reasoning capabilities of strong, small- and mid-sized models.<n>We propose a simple yet effective approach: first training on math-only prompts, then on code-only prompts.
arXiv Detail & Related papers (2025-05-22T08:50:47Z) - The Surprising Effectiveness of Test-Time Training for Few-Shot Learning [59.309477460893916]
Language models (LMs) have shown impressive performance on tasks within their training distribution, but often struggle with structurally novel tasks.<n>We investigate the effectiveness of test-time training (TTT) as a mechanism for improving LMs' reasoning and few-shot learning capabilities.<n>Our findings highlight the limitations of in-context learning for novel tasks and demonstrate the potential of test-time training to enhance language model adaptability.
arXiv Detail & Related papers (2024-11-11T18:59:45Z) - Training on the Test Task Confounds Evaluation and Emergence [16.32378359459614]
We show that training on the test task confounds both relative model evaluations and claims about emergent capabilities.<n>We propose an effective method to adjust for the effect of training on the test task on benchmark evaluations.
arXiv Detail & Related papers (2024-07-10T17:57:58Z) - USB: A Unified Summarization Benchmark Across Tasks and Domains [68.82726887802856]
We introduce a Wikipedia-derived benchmark, complemented by a rich set of crowd-sourced annotations, that supports $8$ interrelated tasks.
We compare various methods on this benchmark and discover that on multiple tasks, moderately-sized fine-tuned models consistently outperform much larger few-shot prompted language models.
arXiv Detail & Related papers (2023-05-23T17:39:54Z) - Training Dynamics for Curriculum Learning: A Study on Monolingual and
Cross-lingual NLU [19.42920238320109]
Curriculum Learning (CL) is a technique of training models via ranking examples in a typically increasing difficulty trend.
In this work, we employ CL for Natural Language Understanding (NLU) tasks by taking advantage of training dynamics as difficulty metrics.
Experiments indicate that training dynamics can lead to better performing models with smoother training compared to other difficulty metrics.
arXiv Detail & Related papers (2022-10-22T17:10:04Z) - Task Compass: Scaling Multi-task Pre-training with Task Prefix [122.49242976184617]
Existing studies show that multi-task learning with large-scale supervised tasks suffers from negative effects across tasks.
We propose a task prefix guided multi-task pre-training framework to explore the relationships among tasks.
Our model can not only serve as the strong foundation backbone for a wide range of tasks but also be feasible as a probing tool for analyzing task relationships.
arXiv Detail & Related papers (2022-10-12T15:02:04Z) - Meta-Reinforcement Learning for Heuristic Planning [12.462608802359936]
In Meta-Reinforcement Learning (meta-RL) an agent is trained on a set of tasks to prepare for and learn faster in new, unseen, but related tasks.
We show that given a set of training tasks, learning can be both faster and more effective if the training tasks are appropriately selected.
We propose a task selection algorithm, Information-Theoretic Task Selection (ITTS), based on information theory.
arXiv Detail & Related papers (2021-07-06T13:25:52Z) - Information-theoretic Task Selection for Meta-Reinforcement Learning [9.69596041242667]
In Meta-Reinforcement Learning (meta-RL) an agent is trained on a set of tasks to prepare for and learn faster in new, unseen, but related tasks.
We show that given a set of training tasks, learning can be both faster and more effective if the training tasks are appropriately selected.
We propose a task selection algorithm, Information-Theoretic Task Selection (ITTS), based on information theory.
arXiv Detail & Related papers (2020-11-02T15:37:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.