Related papers: Rethinking Reinforcement fine-tuning of LLMs: A Multi-armed Bandit Learning Perspective

Rethinking Reinforcement fine-tuning of LLMs: A Multi-armed Bandit Learning Perspective

URL: http://arxiv.org/abs/2601.14599v1
Date: Wed, 21 Jan 2026 02:37:44 GMT
Title: Rethinking Reinforcement fine-tuning of LLMs: A Multi-armed Bandit Learning Perspective
Authors: Xiao Hu, Hong Xie, Tao Tan, Defu Lian, Jianyu Han,
Abstract summary: Two fundamental questions still lack a clear understanding: 1) what is the role of each optimizing choice? 2) which ones are the bottlenecks?<n>This paper aims to shed light on them, and it faces the challenge of several entangled confounding factors in the fine-tuning process.
Score: 54.209612511049734
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A large number of heuristics have been proposed to optimize the reinforcement fine-tuning of LLMs. However, inconsistent claims are made from time to time, making this area elusive. Reflecting on this situation, two fundamental questions still lack a clear understanding: 1) what is the role of each optimizing choice? 2) which ones are the bottlenecks? This paper aims to shed light on them, and it faces the challenge of several entangled confounding factors in the fine-tuning process. To tackle this challenge, we propose a bottom-up experiment pipeline. The bottom layer is composed of a minimalist configuration: one training data, one rollout per round and the reward directly serve as the learning signal without advantage function design. This minimalist configuration connects to multi-armed bandit learning with extremely large discrete action space, which offers theories to corroborate the experiment findings. The up procedure of the experiment pipeline expanding the minimalist configuration layer by layer, examining the role of each design choice. Experimental results on three LLMs and two reasoning datasets not only reveal new understanding of the design choice but also yield essential insights to shape the area.

Related papers

Demystifying Design Choices of Reinforcement Fine-tuning: A Batched Contextual Bandit Learning Perspective [83.75710105509076]
This paper aims to shed light on the role of design choices on learning and generalization dynamics.<n>The underlying challenge is that design choices are entangled together, making their contribution to learning and generalization difficult to attribute.<n> Experiments on three base models and two datasets reveal new understanding on the role of various design choices on learning and generalization dynamics.
arXiv Detail & Related papers (2026-01-30T04:09:06Z)
Deep But Reliable: Advancing Multi-turn Reasoning for Thinking with Images [53.373427633330515]
We propose DRIM, a model that enables deep but reliable multi-turn reasoning when thinking with images in its multimodal CoT.<n>Based on a high-resolution image dataset, we construct high-difficulty and verifiable visual question-answer pairs.<n>In the SFT stage, we collect tool trajectories as cold-start data, guiding a multi-turn reasoning pattern.<n>In the RL stage, we introduce redundancy-penalized policy optimization, which incentivizes the model to develop a self-reflective reasoning pattern.
arXiv Detail & Related papers (2025-12-19T07:44:43Z)
Reason-to-Recommend: Using Interaction-of-Thought Reasoning to Enhance LLM Recommendation [9.282278040339138]
$textbfR2Rec$ is a reasoning-enhanced recommendation framework.<n>It samples interaction chains from the user-item graph and converts them into structured interaction-of-thoughts.
arXiv Detail & Related papers (2025-06-05T14:16:44Z)
Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models [67.87579664988199]
TON is a two-stage training strategy for vision-language models (VLMs)<n>It introduces a think-or-not format that serves as a cold start for selective reasoning.<n>TON can reduce the completion length by up to 90% compared to vanilla GRPO.
arXiv Detail & Related papers (2025-05-22T16:13:29Z)
Maximum Redundancy Pruning: A Principle-Driven Layerwise Sparsity Allocation for LLMs [25.66881879984798]
Large language models (LLMs) have demonstrated impressive capabilities, but their enormous size poses significant challenges for deployment in real-world applications.<n>Recent sparsity allocation methods is often based on iteration or search that can easily lead to suboptimal performance.<n>We propose Redundancy Pruning (MRP), an iterative pruning algorithm that prunes in the most redundant layers.
arXiv Detail & Related papers (2025-03-24T06:17:30Z)
Unveiling the Magic of Code Reasoning through Hypothesis Decomposition and Amendment [54.62926010621013]
We introduce a novel task, code reasoning, to provide a new perspective for the reasoning abilities of large language models.<n>We summarize three meta-benchmarks based on established forms of logical reasoning, and instantiate these into eight specific benchmark tasks.<n>We present a new pathway exploration pipeline inspired by human intricate problem-solving methods.
arXiv Detail & Related papers (2025-02-17T10:39:58Z)
Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search [57.28671084993782]
Large language models (LLMs) have demonstrated remarkable reasoning capabilities across diverse domains.<n>Recent studies have shown that increasing test-time computation enhances LLMs' reasoning capabilities.<n>We propose a two-stage training paradigm: 1) a small-scale format tuning stage to internalize the COAT reasoning format and 2) a large-scale self-improvement stage leveraging reinforcement learning.
arXiv Detail & Related papers (2025-02-04T17:26:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.