Demystifying Design Choices of Reinforcement Fine-tuning: A Batched Contextual Bandit Learning Perspective
- URL: http://arxiv.org/abs/2601.22532v1
- Date: Fri, 30 Jan 2026 04:09:06 GMT
- Title: Demystifying Design Choices of Reinforcement Fine-tuning: A Batched Contextual Bandit Learning Perspective
- Authors: Hong Xie, Xiao Hu, Tao Tan, Haoran Gu, Xin Li, Jianyu Han, Defu Lian, Enhong Chen,
- Abstract summary: This paper aims to shed light on the role of design choices on learning and generalization dynamics.<n>The underlying challenge is that design choices are entangled together, making their contribution to learning and generalization difficult to attribute.<n> Experiments on three base models and two datasets reveal new understanding on the role of various design choices on learning and generalization dynamics.
- Score: 83.75710105509076
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The reinforcement fine-tuning area is undergoing an explosion papers largely on optimizing design choices. Though performance gains are often claimed, inconsistent conclusions also arise from time to time, making the progress illusive. Reflecting on this illusion, we still lack principled answers to two fundamental questions: 1) what is the role of each design choice? 2) which ones are critical? This paper aims to shed light on them. The underlying challenge is that design choices are entangled together, making their contribution to learning and generalization difficult to attribute. To address this challenge, we first construct a minimalist baseline for disentangling factors: one rollout per query in each round, the outcome reward serving as the training signal without any advantage trick, and a batch size of thirty-two. This baseline connects to batched contextual bandit learning, which facilitates experimental analysis. Centering around this baseline, we design an experiment pipeline, examining the marginal gains of factors like advantage, number of rollouts, etc. Experiments on three base models and two datasets, not only reveal new understanding on the role of various design choices on learning and generalization dynamics, but also identify critical ones that deserve more effort.
Related papers
- Rethinking Reinforcement fine-tuning of LLMs: A Multi-armed Bandit Learning Perspective [54.209612511049734]
Two fundamental questions still lack a clear understanding: 1) what is the role of each optimizing choice? 2) which ones are the bottlenecks?<n>This paper aims to shed light on them, and it faces the challenge of several entangled confounding factors in the fine-tuning process.
arXiv Detail & Related papers (2026-01-21T02:37:44Z) - Selection, Reflection and Self-Refinement: Revisit Reasoning Tasks via a Causal Lens [19.316594303998667]
Reasoning tasks have long been regarded as rigorous benchmarks for assessing the capabilities of machine learning models.<n>We revisit reasoning tasks from a causal perspective, seeking to understand their behavior in latent space.<n>We introduce a framework, called SR$2$, that incorporates the estimated latent variables as feedback into the selection mechanism.
arXiv Detail & Related papers (2025-10-09T13:45:31Z) - What MLLMs Learn about When they Learn about Multimodal Reasoning: Perception, Reasoning, or their Integration? [46.836858357488296]
Multimodal reasoning models have recently shown promise on challenging domains such as olympiad-level geometry.<n>We introduce MathLens, a benchmark designed to disentangle the subskills of multimodal reasoning.
arXiv Detail & Related papers (2025-10-02T06:58:29Z) - Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search [85.201906907271]
Mini-o3 is a system that executes deep, multi-turn reasoning spanning tens of steps.<n>Our recipe for reproducing OpenAI o3-style behaviors comprises three key components.<n>Extensive experiments demonstrate that Mini-o3 produces rich reasoning patterns and deep thinking paths.
arXiv Detail & Related papers (2025-09-09T17:54:21Z) - Unlocking the Potential of Difficulty Prior in RL-based Multimodal Reasoning [69.64809103333839]
We investigate how explicitly modeling problem's difficulty prior information shapes the effectiveness of reinforcement learning based fine-tuning for multimodal reasoning.<n>Our approach demonstrates significant performances across various multi-modal mathematical reasoning benchmarks with only 2K+0.6K two-stage training data.
arXiv Detail & Related papers (2025-05-19T15:43:10Z) - Cross-Modal Contrastive Learning for Robust Reasoning in VQA [76.1596796687494]
Multi-modal reasoning in visual question answering (VQA) has witnessed rapid progress recently.
Most reasoning models heavily rely on shortcuts learned from training data.
We propose a simple but effective cross-modal contrastive learning strategy to get rid of the shortcut reasoning.
arXiv Detail & Related papers (2022-11-21T05:32:24Z) - REX: Reasoning-aware and Grounded Explanation [30.392986232906107]
We develop a new type of multi-modal explanations that explain the decisions by traversing the reasoning process and grounding keywords in the images.
Second, we identify the critical need to tightly couple important components across the visual and textual modalities for explaining the decisions.
Third, we propose a novel explanation generation method that explicitly models the pairwise correspondence between words and regions of interest.
arXiv Detail & Related papers (2022-03-11T17:28:42Z) - Which Samples Should be Learned First: Easy or Hard? [5.589137389571604]
weighting scheme for training samples is essential for learning tasks.
Some schemes take the easy-first mode on samples, whereas some others take the hard-first mode.
Factors including prior knowledge and data characteristics determine which samples should be learned first in a learning task.
arXiv Detail & Related papers (2021-10-11T03:40:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.