Reasoning with Reinforced Functional Token Tuning
- URL: http://arxiv.org/abs/2502.13389v1
- Date: Wed, 19 Feb 2025 02:59:42 GMT
- Title: Reasoning with Reinforced Functional Token Tuning
- Authors: Kongcheng Zhang, Qi Yao, Baisheng Lai, Jiaxing Huang, Wenkai Fang, Dacheng Tao, Mingli Song, Shunyu Liu,
- Abstract summary: We propose Reinforced Functional Token Tuning (RFTT) to empower Large Language Models (LLMs) with self-play learn-to-reason capabilities.<n>RFTT embeds a rich set of learnable functional tokens directly into the model vocabulary, enabling chain-of-thought construction with diverse human-like reasoning behaviors.
- Score: 70.96651128307985
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we propose Reinforced Functional Token Tuning (RFTT), a novel reinforced fine-tuning framework that empowers Large Language Models (LLMs) with self-play learn-to-reason capabilities. Unlike prior prompt-driven reasoning efforts, RFTT embeds a rich set of learnable functional tokens (e.g., <analyze>, <verify>, <refine>) directly into the model vocabulary, enabling chain-of-thought construction with diverse human-like reasoning behaviors. Specifically, RFTT comprises two phases: (1) supervised fine-tuning performs prompt-driven tree search to obtain self-generated training data annotated with functional tokens, which warms up the model to learn these tokens for reasoning; and (2) online reinforcement learning further allows the model to explore different reasoning pathways through functional token sampling without relying on prompts, thereby facilitating effective self-improvement for functional reasoning. Extensive experiments demonstrate the superiority of the proposed RFTT on mathematical benchmarks, significantly boosting Qwen-2.5-7B-Instruct (70.6% to 79.8%) and LLaMA-3.1-8B-Instruct (32.2% to 60.2%) on the MATH dataset. Moreover, the performance of RFTT consistently improves with more search rollouts at inference time. Our code is available at https://github.com/sastpg/RFTT.
Related papers
- Exploring Superior Function Calls via Reinforcement Learning [9.278264697070306]
We present a novel reinforcement learning framework designed to enhance group relative policy optimization.<n>We address three critical challenges in function calling: insufficient exploration during policy learning, lack of structured reasoning in chain-of-thought generation, and inadequate verification of parameter extraction.<n>Our framework achieves state-of-the-art performance among open-source models with 86.02% overall accuracy, outperforming standard GRPO by up to 6% on complex multi-function scenarios.
arXiv Detail & Related papers (2025-08-07T07:51:38Z) - Light-IF: Endowing LLMs with Generalizable Reasoning via Preview and Self-Checking for Complex Instruction Following [10.119219532863767]
lazy reasoning during the thinking stage is the primary factor contributing to poor instruction adherence.<n>We propose a comprehensive framework designed to enable rigorous reasoning processes involving preview and self-checking.<n>Our Light-IF-32B model surpasses both larger open-source models such as DeepSeek-R1 and closed-source models like Doubao-1.6.
arXiv Detail & Related papers (2025-08-05T07:42:00Z) - Teaching LLM to Reason: Reinforcement Learning from Algorithmic Problems without Code [76.80306464249217]
We propose TeaR, which aims at teaching LLMs to reason better.<n>TeaR leverages careful data curation and reinforcement learning to guide models in discovering optimal reasoning paths through code-related tasks.<n>We conduct extensive experiments using two base models and three long-CoT distillation models, with model sizes ranging from 1.5 billion to 32 billion parameters, and across 17 benchmarks spanning Math, Knowledge, Code, and Logical Reasoning.
arXiv Detail & Related papers (2025-07-10T07:34:05Z) - Accelerated Test-Time Scaling with Model-Free Speculative Sampling [58.69141724095398]
We introduce STAND (STochastic Adaptive N-gram Drafting), a novel model-free speculative decoding approach.<n>We show that STAND reduces inference latency by 60-65% compared to standard autoregressive decoding.<n>As a model-free approach, STAND can be applied to any existing language model without additional training.
arXiv Detail & Related papers (2025-06-05T07:31:18Z) - Activation Control for Efficiently Eliciting Long Chain-of-thought Ability of Language Models [45.938663388013445]
We show that a small set of high-impact activations in the last few layers governs long-form reasoning attributes.<n>By simply amplifying these activations and inserting "wait" tokens, we can invoke the long CoT ability without any training.
arXiv Detail & Related papers (2025-05-23T10:07:18Z) - Mind the Gap: Bridging Thought Leap for Improved Chain-of-Thought Tuning [54.65050470296886]
We propose the CoT Thought Leap Bridge Task, which aims to automatically detect leaps and generate missing intermediate reasoning steps.<n>We demonstrate that models fine-tuned on bridged datasets consistently outperform those trained on original datasets.<n>Our approach effectively enhances distilled data and provides better starting points for reinforcement learning.
arXiv Detail & Related papers (2025-05-20T17:59:31Z) - The First Few Tokens Are All You Need: An Efficient and Effective Unsupervised Prefix Fine-Tuning Method for Reasoning Models [69.798277882245]
We introduce Unsupervised Prefix Fine-Tuning (UPFT) to enhance large language models' reasoning efficiency.
UPFT removes the need for labeled data or exhaustive sampling.
Experiments show that UPFT matches the performance of supervised methods.
arXiv Detail & Related papers (2025-03-04T18:56:03Z) - Unveiling Reasoning Thresholds in Language Models: Scaling, Fine-Tuning, and Interpretability through Attention Maps [3.8936716676293917]
This study investigates the in-context learning capabilities of various decoder-only transformer-based language models with different model sizes and training data.
We identify a critical parameter threshold (1.6 billion), beyond which reasoning performance improves significantly in tasks such as commonsense reasoning in multiple-choice question answering and deductive reasoning.
arXiv Detail & Related papers (2025-02-21T00:48:32Z) - SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models [88.29990536278167]
We introduce SPaR, a self-play framework integrating tree-search self-refinement to yield valid and comparable preference pairs free from distractions.<n>Our experiments show that a LLaMA3-8B model, trained over three iterations guided by SPaR, surpasses GPT-4-Turbo on the IFEval benchmark without losing general capabilities.
arXiv Detail & Related papers (2024-12-16T09:47:43Z) - The Surprising Effectiveness of Test-Time Training for Abstract Reasoning [64.36534512742736]
We investigate the effectiveness of test-time training (TTT) as a mechanism for improving models' reasoning capabilities.
TTT significantly improves performance on ARC tasks, achieving up to 6x improvement in accuracy compared to base fine-tuned models.
Our findings suggest that explicit symbolic search is not the only path to improved abstract reasoning in neural language models.
arXiv Detail & Related papers (2024-11-11T18:59:45Z) - Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models [63.36637269634553]
We present a novel method of further improving performance by requiring models to compare multiple reasoning chains.
We find that instruction tuning on DCoT datasets boosts the performance of even smaller, and therefore more accessible, language models.
arXiv Detail & Related papers (2024-07-03T15:01:18Z) - Unlock the Correlation between Supervised Fine-Tuning and Reinforcement Learning in Training Code Large Language Models [12.656574142412484]
We make an attempt to understand the correlation between supervised fine-tuning and reinforcement learning.<n>We find that both atomic and synthetic functions are indispensable for SFT's generalization.
arXiv Detail & Related papers (2024-06-14T03:39:01Z) - Phased Instruction Fine-Tuning for Large Language Models [12.037895935630882]
Phased Instruction Fine-Tuning (Phased IFT) is proposed, based on the idea that learning to follow instructions is a gradual process.
It assesses instruction difficulty using GPT-4, divides the instruction data into subsets of increasing difficulty, and uptrains the model sequentially on these subsets.
Experiments with Llama-2 7B/13B/70B, Llama3 8/70B and Mistral-7B models using Alpaca data show that Phased IFT significantly outperforms One-off IFT.
arXiv Detail & Related papers (2024-06-01T04:25:26Z) - Latent Template Induction with Gumbel-CRFs [107.17408593510372]
We explore the use of structured variational autoencoders to infer latent templates for sentence generation.
As a structured inference network, we show that it learns interpretable templates during training.
arXiv Detail & Related papers (2020-11-29T01:00:57Z) - DAIS: Automatic Channel Pruning via Differentiable Annealing Indicator
Search [55.164053971213576]
convolutional neural network has achieved great success in fulfilling computer vision tasks despite large computation overhead.
Structured (channel) pruning is usually applied to reduce the model redundancy while preserving the network structure.
Existing structured pruning methods require hand-crafted rules which may lead to tremendous pruning space.
arXiv Detail & Related papers (2020-11-04T07:43:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.