Test-time Adaptation of Tiny Recursive Models
- URL: http://arxiv.org/abs/2511.02886v1
- Date: Tue, 04 Nov 2025 13:47:45 GMT
- Title: Test-time Adaptation of Tiny Recursive Models
- Authors: Ronan Killian McGovern,
- Abstract summary: This paper shows that one can efficiently fine-tune on competition tasks within the allowed compute limits.<n>Specifically, a model was pre-trained on 1,280 public tasks for 700k+ steps over 48 hours on 4xH100 SXM to obtain a 10% score on the public evaluation set.<n>That model was then post-trained in just 12,500 gradient steps during the competition to reach a score of 6.67% on semi-private evaluation tasks.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Prior to the close of the 2025 ARC Prize competition, the leading open source approach - known as TRM, or Tiny Recursive Models - involved training a 7M parameter recursive neural network on augmented variants of ARC tasks. That approach scored approximately 7.8% on the public ARC AGI II evaluation set, but required a level of compute far in excess of what is allowed during the competition. This paper shows that, by starting from a tiny recursive model that has been pre-trained on public ARC tasks, one can efficiently fine-tune on competition tasks within the allowed compute limits. Specifically, a model was pre-trained on 1,280 public tasks for 700k+ optimizer steps over 48 hours on 4xH100 SXM GPUs to obtain a ~10% score on the public evaluation set. That model was then post-trained in just 12,500 gradient steps during the competition to reach a score of 6.67% on semi-private evaluation tasks. Notably, such post-training performance is achieved by full-fine tuning of the tiny model, not LoRA fine-tuning or fine-tuning of task embeddings alone.
Related papers
- Accelerating Training Speed of Tiny Recursive Models via Curriculum Guided Adaptive Recursion [3.806023028063132]
CGAR is a novel training methodology that applies curriculum learning to architectural depth rather than traditional data ordering.<n>On Sudoku-Extreme with 423,168 test puzzles, CGAR achieves 1.71x training speedup with only 0.63% accuracy drop.<n>CGAR-trained models exhibit superior inference efficiency with 100% halting accuracy and 11% fewer reasoning steps.
arXiv Detail & Related papers (2025-11-11T08:17:23Z) - Exploring the Hierarchical Reasoning Model for Small Natural-Image Classification Without Augmentation [51.56484100374058]
It is evaluated on MNIST, CIFAR-10, and CIFAR-100 under a deliberately raw regime.<n>It is concluded that, for small-resolution image classification without augmentation, HRM is not competitive with even simple convolutional architectures.
arXiv Detail & Related papers (2025-10-04T01:22:41Z) - LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model [99.71684530652942]
We show that LLaVA-Critic-R1 emerges as a top-performing critic but also as a competitive policy model.<n>Applying self-critique at test time yields an average +13.8% improvement on five representative reasoning tasks.<n>Our results reveal that RL training on critic data can produce a unified model excelling at both evaluation and generation.
arXiv Detail & Related papers (2025-08-31T03:08:02Z) - ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs [75.72672339168092]
We introduce ReasonFlux-PRM, a novel trajectory-aware PRM to evaluate trajectory-response type of reasoning traces.<n>ReasonFlux-PRM incorporates both step-level and trajectory-level supervision, enabling fine-grained reward assignment aligned with structured chain-of-thought data.<n>Our derived ReasonFlux-PRM-7B yields consistent performance improvements, achieving average gains of 12.1% in supervised fine-tuning, 4.5% in reinforcement learning, and 6.3% in test-time scaling.
arXiv Detail & Related papers (2025-06-23T17:59:02Z) - Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models [43.98994504606355]
We propose Reinforcement Learning via Self-Confidence (RLSC) for large language models (LLMs)<n>RLSC uses the model's own confidence as reward signals-eliminating the need for labels, preference models, or reward engineering.
arXiv Detail & Related papers (2025-06-05T19:55:15Z) - Reinforcement Learning for Reasoning in Large Language Models with One Training Example [117.86853102104256]
We show that reinforcement learning with verifiable reward using one training example (1-shot RLVR) is effective in incentivizing the math reasoning capabilities of large language models (LLMs)<n>We identify some interesting phenomena during 1-shot RLVR, including cross-category generalization, increased frequency of self-reflection, and sustained test performance improvement.
arXiv Detail & Related papers (2025-04-29T09:24:30Z) - Kimi k1.5: Scaling Reinforcement Learning with LLMs [84.95584393629998]
We report on the training practice of Kimi k1.5, our latest multi-modal language model trained with reinforcement learning.<n>Long context scaling and improved policy optimization methods are key ingredients of our approach.<n>Our system achieves state-of-the-art reasoning performance across multiple benchmarks and modalities.
arXiv Detail & Related papers (2025-01-22T02:48:14Z) - H-ARC: A Robust Estimate of Human Performance on the Abstraction and Reasoning Corpus Benchmark [7.840781070208872]
Since 2019, limited progress has been observed on the challenge using existing artificial intelligence methods.
Previous work explored how well humans can solve tasks from the ARC benchmark.
We obtain a more robust estimate of human performance by evaluating 1729 humans on the full set of 400 training and 400 evaluation tasks.
arXiv Detail & Related papers (2024-09-02T17:11:32Z) - Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs [54.05511925104712]
We propose a simple, effective, and data-efficient method called Step-DPO.
Step-DPO treats individual reasoning steps as units for preference optimization rather than evaluating answers holistically.
Our findings demonstrate that as few as 10K preference data pairs and fewer than 500 Step-DPO training steps can yield a nearly 3% gain in accuracy on MATH for models with over 70B parameters.
arXiv Detail & Related papers (2024-06-26T17:43:06Z) - Biased Self-supervised learning for ASR [31.701098864180256]
This paper proposes a method to bias self-supervised learning towards a specific task.
The core idea is to slightly finetune the model that is used to obtain the target sequence.
For the streaming models, the pre-training approach yields a reduction in word error rate of 44.1%.
arXiv Detail & Related papers (2022-11-04T15:57:59Z) - The ReturnZero System for VoxCeleb Speaker Recognition Challenge 2022 [0.0]
We describe the top-scoring submissions for team RTZR VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC-22)
The top performed system is a fusion of 7 models, which contains 3 different types of model architectures.
The final submission achieves 0.165 DCF and 2.912% EER on the VoxSRC22 test set.
arXiv Detail & Related papers (2022-09-21T06:54:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.