Reasoning with Reinforced Functional Token Tuning
- URL: http://arxiv.org/abs/2502.13389v1
- Date: Wed, 19 Feb 2025 02:59:42 GMT
- Title: Reasoning with Reinforced Functional Token Tuning
- Authors: Kongcheng Zhang, Qi Yao, Baisheng Lai, Jiaxing Huang, Wenkai Fang, Dacheng Tao, Mingli Song, Shunyu Liu,
- Abstract summary: We propose Reinforced Functional Token Tuning (RFTT) to empower Large Language Models (LLMs) with self-play learn-to-reason capabilities.
RFTT embeds a rich set of learnable functional tokens directly into the model vocabulary, enabling chain-of-thought construction with diverse human-like reasoning behaviors.
- Score: 70.96651128307985
- License:
- Abstract: In this work, we propose Reinforced Functional Token Tuning (RFTT), a novel reinforced fine-tuning framework that empowers Large Language Models (LLMs) with self-play learn-to-reason capabilities. Unlike prior prompt-driven reasoning efforts, RFTT embeds a rich set of learnable functional tokens (e.g., <analyze>, <verify>, <refine>) directly into the model vocabulary, enabling chain-of-thought construction with diverse human-like reasoning behaviors. Specifically, RFTT comprises two phases: (1) supervised fine-tuning performs prompt-driven tree search to obtain self-generated training data annotated with functional tokens, which warms up the model to learn these tokens for reasoning; and (2) online reinforcement learning further allows the model to explore different reasoning pathways through functional token sampling without relying on prompts, thereby facilitating effective self-improvement for functional reasoning. Extensive experiments demonstrate the superiority of the proposed RFTT on mathematical benchmarks, significantly boosting Qwen-2.5-7B-Instruct (70.6% to 79.8%) and LLaMA-3.1-8B-Instruct (32.2% to 60.2%) on the MATH dataset. Moreover, the performance of RFTT consistently improves with more search rollouts at inference time. Our code is available at https://github.com/sastpg/RFTT.
Related papers
- SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models [88.29990536278167]
We introduce SPaR, a self-play framework integrating tree-search self-refinement to yield valid and comparable preference pairs free from distractions.
Our experiments show that a LLaMA3-8B model, trained over three iterations guided by SPaR, surpasses GPT-4-Turbo on the IFEval benchmark without losing general capabilities.
arXiv Detail & Related papers (2024-12-16T09:47:43Z) - The Surprising Effectiveness of Test-Time Training for Abstract Reasoning [64.36534512742736]
We investigate the effectiveness of test-time training (TTT) as a mechanism for improving models' reasoning capabilities.
TTT significantly improves performance on ARC tasks, achieving up to 6x improvement in accuracy compared to base fine-tuned models.
Our findings suggest that explicit symbolic search is not the only path to improved abstract reasoning in neural language models.
arXiv Detail & Related papers (2024-11-11T18:59:45Z) - Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models [63.36637269634553]
We present a novel method of further improving performance by requiring models to compare multiple reasoning chains.
We find that instruction tuning on DCoT datasets boosts the performance of even smaller, and therefore more accessible, language models.
arXiv Detail & Related papers (2024-07-03T15:01:18Z) - Unlock the Correlation between Supervised Fine-Tuning and Reinforcement Learning in Training Code Large Language Models [12.656574142412484]
We make an attempt to understand the correlation between supervised fine-tuning and reinforcement learning.
We find that both atomic and synthetic functions are indispensable for SFT's generalization.
arXiv Detail & Related papers (2024-06-14T03:39:01Z) - Phased Instruction Fine-Tuning for Large Language Models [12.037895935630882]
Phased Instruction Fine-Tuning (Phased IFT) is proposed, based on the idea that learning to follow instructions is a gradual process.
It assesses instruction difficulty using GPT-4, divides the instruction data into subsets of increasing difficulty, and uptrains the model sequentially on these subsets.
Experiments with Llama-2 7B/13B/70B, Llama3 8/70B and Mistral-7B models using Alpaca data show that Phased IFT significantly outperforms One-off IFT.
arXiv Detail & Related papers (2024-06-01T04:25:26Z) - Latent Template Induction with Gumbel-CRFs [107.17408593510372]
We explore the use of structured variational autoencoders to infer latent templates for sentence generation.
As a structured inference network, we show that it learns interpretable templates during training.
arXiv Detail & Related papers (2020-11-29T01:00:57Z) - DAIS: Automatic Channel Pruning via Differentiable Annealing Indicator
Search [55.164053971213576]
convolutional neural network has achieved great success in fulfilling computer vision tasks despite large computation overhead.
Structured (channel) pruning is usually applied to reduce the model redundancy while preserving the network structure.
Existing structured pruning methods require hand-crafted rules which may lead to tremendous pruning space.
arXiv Detail & Related papers (2020-11-04T07:43:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.