How Does RL Post-training Induce Skill Composition? A Case Study on Countdown
- URL: http://arxiv.org/abs/2512.01775v1
- Date: Mon, 01 Dec 2025 15:17:16 GMT
- Title: How Does RL Post-training Induce Skill Composition? A Case Study on Countdown
- Authors: Simon Park, Simran Kaur, Sanjeev Arora,
- Abstract summary: We study what reinforcement learning teaches about skill composition and how the structure of the composition affects the skill transfer.<n> Tracking tree shapes and their success rates over training, we find: (i) out-of-distribution generalization to larger n and to unseen tree shapes, indicating compositional reuse of subtasks.<n>Our diagnostic reveals what is learned, in what order, and where generalization fails, clarifying how RL-only post-training induces OOD generalization beyond what standard metrics such as pass@k reveal.
- Score: 27.950240848542645
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While reinforcement learning (RL) successfully enhances reasoning in large language models, its role in fostering compositional generalization (the ability to synthesize novel skills from known components) is often conflated with mere length generalization. To this end, we study what RL post-training teaches about skill composition and how the structure of the composition affects the skill transfer. We focus on the Countdown task (given n numbers and a target, form an expression that evaluates to the target) and analyze model solutions as expression trees, where each subtree corresponds to a reusable subtask and thus can be viewed as a ``skill.'' Tracking tree shapes and their success rates over training, we find: (i) out-of-distribution (OOD) generalization to larger n and to unseen tree shapes, indicating compositional reuse of subtasks; (ii) a structure-dependent hierarchy of learnability -- models master shallow balanced trees (workload is balanced between subtasks) before deep unbalanced ones, with persistent fragility on right-heavy structures (even when the composition depth is the same as some left-heavy structures). Our diagnostic reveals what is learned, in what order, and where generalization fails, clarifying how RL-only post-training induces OOD generalization beyond what standard metrics such as pass@k reveal.
Related papers
- TreePS-RAG: Tree-based Process Supervision for Reinforcement Learning in Agentic RAG [71.06073770344732]
Agentic retrieval-augmented generation (RAG) formulates question answering as a multi-step interaction between reasoning and information retrieval.<n>We present TreePS-RAG, an online, tree-based RL framework for agentic RAG that enables step-wise credit assignment while retaining outcome-only rewards.
arXiv Detail & Related papers (2026-01-11T14:07:30Z) - From Atomic to Composite: Reinforcement Learning Enables Generalization in Complementary Reasoning [83.94543243783285]
We study Complementary Reasoning, a complex task that requires integrating internal parametric knowledge with external contextual information.<n>We find that RL acts as a reasoning synthesizer rather than a probability amplifier.
arXiv Detail & Related papers (2025-12-01T18:27:25Z) - Tree Search for LLM Agent Reinforcement Learning [23.7084695563981]
Tree-based Group Relative Policy Optimization (Tree-GRPO) is a grouped agent RL method based on tree search.<n>By sharing common prefixes, the tree search sampling increases the number of rollouts achievable.<n>We demonstrate that the objective of intra-tree level group relative policy optimization is equivalent to that of step-level direct preference learning.
arXiv Detail & Related papers (2025-09-25T14:37:09Z) - Unveiling the Mechanisms of Explicit CoT Training: How CoT Enhances Reasoning Generalization [9.191236388401226]
The integration of explicit Chain-of-Thought (CoT) reasoning into training large language models has advanced their reasoning capabilities, yet the mechanisms by which CoT enhances generalization remain poorly understood.<n>This work investigates (1) textithow CoT training reshapes internal model representations and (2) textitwhy it improves both in-distribution (ID) and out-of-distribution (OOD) reasoning generalization.
arXiv Detail & Related papers (2025-02-07T05:21:13Z) - When does compositional structure yield compositional generalization? A kernel theory [0.0]
We present a theory of compositional generalization in kernel models with fixed, compositionally structured representations.<n>We identify novel failure modes in compositional generalization that arise from biases in the training data.<n>This work examines how statistical structure in the training data can affect compositional generalization.
arXiv Detail & Related papers (2024-05-26T00:50:11Z) - A Theory for Emergence of Complex Skills in Language Models [56.947273387302616]
A major driver of AI products today is the fact that new skills emerge in language models when their parameter set and training corpora are scaled up.
This paper takes a different approach, analysing emergence using the famous (and empirical) Scaling Laws of LLMs and a simple statistical framework.
arXiv Detail & Related papers (2023-07-29T09:22:54Z) - How Do In-Context Examples Affect Compositional Generalization? [86.57079616209474]
In this paper, we present CoFe, a test suite to investigate in-context compositional generalization.
We find that the compositional generalization performance can be easily affected by the selection of in-context examples.
Our systematic experiments indicate that in-context examples should be structurally similar to the test case, diverse from each other, and individually simple.
arXiv Detail & Related papers (2023-05-08T16:32:18Z) - RLET: A Reinforcement Learning Based Approach for Explainable QA with
Entailment Trees [47.745218107037786]
We propose RLET, a Reinforcement Learning based Entailment Tree generation framework.
RLET iteratively performs single step reasoning with sentence selection and deduction generation modules.
Experiments on three settings of the EntailmentBank dataset demonstrate the strength of using RL framework.
arXiv Detail & Related papers (2022-10-31T06:45:05Z) - Compositional Generalization Requires Compositional Parsers [69.77216620997305]
We compare sequence-to-sequence models and models guided by compositional principles on the recent COGS corpus.
We show structural generalization is a key measure of compositional generalization and requires models that are aware of complex structure.
arXiv Detail & Related papers (2022-02-24T07:36:35Z) - Interpretable Preference-based Reinforcement Learning with
Tree-Structured Reward Functions [2.741266294612776]
We propose an online, active preference learning algorithm that constructs reward functions with the intrinsically interpretable, compositional structure of a tree.
We demonstrate sample-efficient learning of tree-structured reward functions in several environments, then harness the enhanced interpretability to explore and debug for alignment.
arXiv Detail & Related papers (2021-12-20T09:53:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.