Related papers: Transformers with RL or SFT Provably Learn Sparse Boolean Functions, But Differently

Transformers with RL or SFT Provably Learn Sparse Boolean Functions, But Differently

URL: http://arxiv.org/abs/2511.17852v1
Date: Sat, 22 Nov 2025 00:38:43 GMT
Title: Transformers with RL or SFT Provably Learn Sparse Boolean Functions, But Differently
Authors: Bochen Lyu, Yiyang Jia, Xiaohao Cai, Zhanxing Zhu,
Abstract summary: Reinforcement learning (RL) and supervised fine-tuning (SFT) are two primary approaches to this end, yet their underlying mechanisms and differences remain theoretically unclear.<n>We analyze the learning dynamics of fine-tuning the transformer via either RL or SFT with CoT to identify sufficient conditions for it to provably learn these functions.<n> Notably, we reveal that RL and SFT exhibit distinct learning behaviors: RL learns the whole CoT chain simultaneously, whereas SFT learns the CoT chain step-by-step.
Score: 20.12397699480725
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformers can acquire Chain-of-Thought (CoT) capabilities to solve complex reasoning tasks through fine-tuning. Reinforcement learning (RL) and supervised fine-tuning (SFT) are two primary approaches to this end, yet their underlying mechanisms and differences remain theoretically unclear. In this work, we examine these aspects specifically for learning $k$-sparse Boolean functions with a one-layer transformer and intermediate supervision that is akin to CoT. In particular, we consider $k$-sparse Boolean functions that can be recursively decomposed into fixed 2-sparse Boolean functions. We analyze the learning dynamics of fine-tuning the transformer via either RL or SFT with CoT to identify sufficient conditions for it to provably learn these functions. We verify that these conditions hold for three basic examples, including $k$-PARITY, $k$-AND, and $k$-OR, thus demonstrating the learnability of both approaches. Notably, we reveal that RL and SFT exhibit distinct learning behaviors: RL learns the whole CoT chain simultaneously, whereas SFT learns the CoT chain step-by-step. Overall, our findings provide theoretical insights into the underlying mechanisms of RL and SFT as well as how they differ in triggering the CoT capabilities of transformers.

Related papers

Mixture-of-Transformers Learn Faster: A Theoretical Study on Classification Problems [59.94955550958074]
We study a tractable theoretical framework in which each transformer block acts as an expert governed by a continuously trained gating network.<n>We show that expert specialization reduces gradient conflicts and makes each subtask strongly convex.<n>We prove that the training drives the expected prediction loss to near zero in $O(log(epsilon-1)$ steps, significantly improving over the $O(epsilon-1)$ rate for a single transformer.
arXiv Detail & Related papers (2025-10-30T21:07:36Z)
The Kinetics of Reasoning: How Chain-of-Thought Shapes Learning in Transformers? [25.29458951592086]
Chain-of-thought (CoT) supervision can substantially improve transformer performance.<n>We investigate these learning dynamics through the lens of grokking by pretraining transformers on symbolic reasoning tasks.
arXiv Detail & Related papers (2025-10-28T20:14:26Z)
Provable In-Context Learning of Nonlinear Regression with Transformers [66.99048542127768]
In-context learning (ICL) is the ability to perform unseen tasks using task specific prompts without updating parameters.<n>Recent research has actively explored the training dynamics behind ICL, with much of the focus on relatively simple tasks.<n>This paper investigates more complex nonlinear regression tasks, aiming to uncover how transformers acquire in-context learning capabilities.
arXiv Detail & Related papers (2025-07-28T00:09:28Z)
Learning Compositional Functions with Transformers from Easy-to-Hard Data [63.96562216704653]
We study the learnability of the $k$-fold composition task, which requires computing an interleaved composition of $k$ input permutations and $k$ hidden permutations.<n>We show that this function class can be efficiently learned, with runtime and sample in $k$, by gradient descent on an $O(log k)$-depth transformer.
arXiv Detail & Related papers (2025-05-29T17:22:00Z)
Finite State Automata Inside Transformers with Chain-of-Thought: A Mechanistic Study on State Tracking [41.3496135369579]
Chain-of-thought (CoT) significantly enhances the performance of large language models (LLMs) across a wide range of tasks.<n>There is limited mechanistic understanding of the algorithms that Transformer+CoT can learn.<n>We evaluate the state tracking capabilities of Transformer+CoT and its variants, confirming the effectiveness of CoT.
arXiv Detail & Related papers (2025-02-27T14:24:51Z)
From Sparse Dependence to Sparse Attention: Unveiling How Chain-of-Thought Enhances Transformer Sample Efficiency [17.612497960364916]
Chain-of-thought (CoT) significantly enhances the reasoning performance of large language models (LLM)<n>We demonstrate that CoT can substantially improve sample efficiency even when representation power is sufficient.<n>We show that CoT simplifies the learning process by introducing sparse dependencies among input tokens, and leads to a sparse and interpretable attention.
arXiv Detail & Related papers (2024-10-07T19:45:09Z)
Can Transformers Learn $n$-gram Language Models? [77.35809823602307]
We study transformers' ability to learn random $n$-gram LMs of two kinds. We find that classic estimation techniques for $n$-gram LMs such as add-$lambda$ smoothing outperform transformers.
arXiv Detail & Related papers (2024-10-03T21:21:02Z)
Training Nonlinear Transformers for Chain-of-Thought Inference: A Theoretical Generalization Analysis [82.51626700527835]
Chain-of-shift (CoT) is an efficient method that enables the reasoning ability of large language models by augmenting the query using examples with multiple intermediate steps.<n>We show that despite the theoretical success of CoT, it fails to provide an accurate generalization when CoT does.
arXiv Detail & Related papers (2024-10-03T03:12:51Z)
Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers [54.20763128054692]
We study how a two-attention-layer transformer is trained to perform ICL on $n$-gram Markov chain data. We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model.
arXiv Detail & Related papers (2024-09-09T18:10:26Z)
Unlock the Correlation between Supervised Fine-Tuning and Reinforcement Learning in Training Code Large Language Models [12.656574142412484]
We make an attempt to understand the correlation between supervised fine-tuning and reinforcement learning.<n>We find that both atomic and synthetic functions are indispensable for SFT's generalization.
arXiv Detail & Related papers (2024-06-14T03:39:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.