Related papers: Prior Prompt Engineering for Reinforcement Fine-Tuning

Prior Prompt Engineering for Reinforcement Fine-Tuning

URL: http://arxiv.org/abs/2505.14157v1
Date: Tue, 20 May 2025 10:05:11 GMT
Title: Prior Prompt Engineering for Reinforcement Fine-Tuning
Authors: Pittawat Taveekitworachai, Potsawee Manakul, Sarana Nutanong, Kunat Pipatanakul,
Abstract summary: We investigate prior prompt engineering (pPE) in the context of reinforcement fine-tuning (RFT)<n>Inspired by inference-time prompt engineering (iPE), we translate five representative iPE strategies--reasoning, planning, code-based reasoning, knowledge recall, and null-example utilization--into corresponding pPE approaches.<n>Our results show that all pPE-trained models surpass their iPE-prompted counterparts.
Score: 16.695988860068315
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper investigates prior prompt engineering (pPE) in the context of reinforcement fine-tuning (RFT), where language models (LMs) are incentivized to exhibit behaviors that maximize performance through reward signals. While existing RFT research has primarily focused on algorithms, reward shaping, and data curation, the design of the prior prompt--the instructions prepended to queries during training to elicit behaviors such as step-by-step reasoning--remains underexplored. We investigate whether different pPE approaches can guide LMs to internalize distinct behaviors after RFT. Inspired by inference-time prompt engineering (iPE), we translate five representative iPE strategies--reasoning, planning, code-based reasoning, knowledge recall, and null-example utilization--into corresponding pPE approaches. We experiment with Qwen2.5-7B using each of the pPE approaches, then evaluate performance on in-domain and out-of-domain benchmarks (e.g., AIME2024, HumanEval+, and GPQA-Diamond). Our results show that all pPE-trained models surpass their iPE-prompted counterparts, with the null-example pPE approach achieving the largest average performance gain and the highest improvement on AIME2024 and GPQA-Diamond, surpassing the commonly used reasoning approach. Furthermore, by adapting a behavior-classification framework, we demonstrate that different pPE strategies instill distinct behavioral styles in the resulting models. These findings position pPE as a powerful yet understudied axis for RFT.

Related papers

Enhancing Training Data Attribution with Representational Optimization [57.61977909113113]
Training data attribution methods aim to measure how training data impacts a model's predictions.<n>We propose AirRep, a representation-based approach that closes this gap by learning task-specific and model-aligned representations explicitly for TDA.<n>AirRep introduces two key innovations: a trainable encoder tuned for attribution quality, and an attention-based pooling mechanism that enables accurate estimation of group-wise influence.
arXiv Detail & Related papers (2025-05-24T05:17:53Z)
Of All StrIPEs: Investigating Structure-informed Positional Encoding for Efficient Music Generation [1.3108652488669736]
We present a unified framework based on kernel methods to analyze both families of efficient PEs.<n>We develop a novel PE method called RoPE, capable of extracting causal relationships from temporal sequences.<n>For empirical validation, we use a symbolic music generation task, namely, melody harmonization.
arXiv Detail & Related papers (2025-04-07T11:51:29Z)
SPPD: Self-training with Process Preference Learning Using Dynamic Value Margin [16.346540681903804]
We propose textbfSelf-training framework integrating textbfProcess textbfPreference learning using textbfDynamic value margin (SPPD)<n>Experiments on 7B-scale models demonstrate superior performance across in-domain and out-domain mathematical benchmarks.
arXiv Detail & Related papers (2025-02-19T08:11:26Z)
Reward Prediction Error Prioritisation in Experience Replay: The RPE-PER Method [1.600323605807673]
We introduce Reward Predictive Error Prioritised Experience Replay (RPE-PER)<n>RPE-PER prioritises experiences in the buffer based on RPEs.<n>Our method employs a critic network, EMCN, that predicts rewards in addition to the Q-values produced by standard critic networks.
arXiv Detail & Related papers (2025-01-30T02:09:35Z)
Pruning All-Rounder: Rethinking and Improving Inference Efficiency for Large Vision Language Models [42.124670377223175]
We propose a novel framework for inference acceleration called the Pruning All-Rounder (PAR)<n>With a self-supervised learning manner, our method achieves a superior balance between performance and efficiency. Notably, PAR is highly flexible, offering multiple pruning versions to address a range of pruning scenarios.
arXiv Detail & Related papers (2024-12-09T13:02:35Z)
Denoising Pre-Training and Customized Prompt Learning for Efficient Multi-Behavior Sequential Recommendation [69.60321475454843]
We propose DPCPL, the first pre-training and prompt-tuning paradigm tailored for Multi-Behavior Sequential Recommendation. In the pre-training stage, we propose a novel Efficient Behavior Miner (EBM) to filter out the noise at multiple time scales. Subsequently, we propose to tune the pre-trained model in a highly efficient manner with the proposed Customized Prompt Learning (CPL) module.
arXiv Detail & Related papers (2024-08-21T06:48:38Z)
HiDe-PET: Continual Learning via Hierarchical Decomposition of Parameter-Efficient Tuning [55.88910947643436]
We propose a unified framework for continual learning (CL) with pre-trained models (PTMs) and parameter-efficient tuning (PET)<n>We present Hierarchical Decomposition PET (HiDe-PET), an innovative approach that explicitly optimize the objective through incorporating task-specific and task-shared knowledge.<n>Our approach demonstrates remarkably superior performance over a broad spectrum of recent strong baselines.
arXiv Detail & Related papers (2024-07-07T01:50:25Z)
Statistically Efficient Variance Reduction with Double Policy Estimation for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation. We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z)
Secrets of RLHF in Large Language Models Part I: PPO [81.01936993929127]
Large language models (LLMs) have formulated a blueprint for the advancement of artificial general intelligence. reinforcement learning with human feedback (RLHF) emerges as the pivotal technological paradigm underpinning this pursuit. In this report, we dissect the framework of RLHF, re-evaluate the inner workings of PPO, and explore how the parts comprising PPO algorithms impact policy agent training.
arXiv Detail & Related papers (2023-07-11T01:55:24Z)
Higher-Order Generalization Bounds: Learning Deep Probabilistic Programs via PAC-Bayes Objectives [0.0]
We offer a framework for representing PAC-Bayes generalization bounds as programs using DPP-based methods. In particular, we show that DPP techniques may be leveraged to derive generalization bounds that draw on the compositionality of DPP representations. In turn, the bounds we introduce offer principled training objectives for higher-order probabilistic programs.
arXiv Detail & Related papers (2022-03-30T01:14:56Z)
Data Augmentation through Expert-guided Symmetry Detection to Improve Performance in Offline Reinforcement Learning [0.0]
offline estimation of the dynamical model of a Markov Decision Process (MDP) is a non-trivial task. Recent works showed that an expert-guided pipeline relying on Density Estimation methods effectively detects this structure in deterministic environments. We show that the former results lead to a performance improvement when solving the learned MDP and then applying the optimized policy in the real environment.
arXiv Detail & Related papers (2021-12-18T14:32:32Z)
Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO [90.90009491366273]
We study the roots of algorithmic progress in deep policy gradient algorithms through a case study on two popular algorithms. Specifically, we investigate the consequences of "code-level optimizations:" Our results show that they (a) are responsible for most of PPO's gain in cumulative reward over TRPO, and (b) fundamentally change how RL methods function.
arXiv Detail & Related papers (2020-05-25T16:24:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.