Can Prompt Difficulty be Online Predicted for Accelerating RL Finetuning of Reasoning Models?
- URL: http://arxiv.org/abs/2507.04632v3
- Date: Mon, 28 Jul 2025 11:30:13 GMT
- Title: Can Prompt Difficulty be Online Predicted for Accelerating RL Finetuning of Reasoning Models?
- Authors: Yun Qu, Qi Wang, Yixiu Mao, Vincent Tao Hu, Björn Ommer, Xiangyang Ji,
- Abstract summary: This work investigates iterative approximate evaluation for arbitrary prompts.<n>It introduces Model Predictive Prompt Selection (MoPPS), a Bayesian risk-predictive framework.<n>MoPPS reliably predicts prompt difficulty and accelerates training with significantly reduced rollouts.
- Score: 62.579951798437115
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances have witnessed the effectiveness of reinforcement learning (RL) finetuning in enhancing the reasoning capabilities of large language models (LLMs). The optimization process often requires numerous iterations to achieve satisfactory performance, resulting in high computational costs due to the need for frequent prompt evaluations under intensive LLM interactions and repeated policy updates. Appropriate online prompt selection methods reduce iteration steps by prioritizing informative prompts during training, while the pipeline's reliance on exhaustive prompt evaluation and subset selection for optimization still incurs substantial computational overhead due to frequent LLM inference calls. Distinguished from these direct evaluate-then-select schemes, this work investigates iterative approximate evaluation for arbitrary prompts and introduces Model Predictive Prompt Selection (MoPPS), a Bayesian risk-predictive framework that online estimates prompt difficulty without requiring costly LLM interactions. Technically, MoPPS models each prompt's success rate as a latent variable, performs streaming Bayesian inference, and employs posterior sampling in a constructed multi-armed bandit machine, enabling sample efficient and adaptive prompt selection. Extensive experiments across mathematics, planning, and vision-based geometry tasks show that MoPPS reliably predicts prompt difficulty and accelerates training with significantly reduced LLM rollouts.
Related papers
- Online Multi-LLM Selection via Contextual Bandits under Unstructured Context Evolution [31.385024956599676]
Large language models (LLMs) exhibit diverse response behaviors, costs, and strengths.<n>We develop a LinUCB-based algorithm that provably achieves sublinear regret without relying on future context prediction.<n>Our algorithms are theoretically grounded and require no offline fine-tuning or dataset-specific training.
arXiv Detail & Related papers (2025-06-21T10:01:46Z) - Modality-Balancing Preference Optimization of Large Multimodal Models by Adversarial Negative Mining [66.54211199959298]
We propose a novel preference learning framework, Modality-Balancing Preference Optimization (MBPO), to address the modality imbalance in LMMs.<n>MBPO constructs a more effective offline preference dataset by generating hard negatives, i.e., rejected responses misled by LLM biases.<n>It can enhance LMM performance on challenging vision-language tasks and effectively reduce hallucinations.
arXiv Detail & Related papers (2025-05-20T03:59:05Z) - Prompt Tuning with Diffusion for Few-Shot Pre-trained Policy Generalization [55.14484317645865]
We develop a conditional diffusion model to produce exceptional quality prompts for offline reinforcement learning tasks.
We show that the Prompt diffuser is a robust and effective tool for the prompt-tuning process, demonstrating strong performance in the meta-RL tasks.
arXiv Detail & Related papers (2024-11-02T07:38:02Z) - In-context Demonstration Matters: On Prompt Optimization for Pseudo-Supervision Refinement [71.60563181678323]
Large language models (LLMs) have achieved great success across diverse tasks, and fine-tuning is sometimes needed to further enhance generation quality.<n>To handle these challenges, a direct solution is to generate high-confidence'' data from unsupervised downstream tasks.<n>We propose a novel approach, pseudo-supervised demonstrations aligned prompt optimization (PAPO) algorithm, which jointly refines both the prompt and the overall pseudo-supervision.
arXiv Detail & Related papers (2024-10-04T03:39:28Z) - QPO: Query-dependent Prompt Optimization via Multi-Loop Offline Reinforcement Learning [58.767866109043055]
We introduce Query-dependent Prompt Optimization (QPO), which iteratively fine-tune a small pretrained language model to generate optimal prompts tailored to the input queries.<n>We derive insights from offline prompting demonstration data, which already exists in large quantities as a by-product of benchmarking diverse prompts on open-sourced tasks.<n> Experiments on various LLM scales and diverse NLP and math tasks demonstrate the efficacy and cost-efficiency of our method in both zero-shot and few-shot scenarios.
arXiv Detail & Related papers (2024-08-20T03:06:48Z) - Query-Dependent Prompt Evaluation and Optimization with Offline Inverse
RL [62.824464372594576]
We aim to enhance arithmetic reasoning ability of Large Language Models (LLMs) through zero-shot prompt optimization.
We identify a previously overlooked objective of query dependency in such optimization.
We introduce Prompt-OIRL, which harnesses offline inverse reinforcement learning to draw insights from offline prompting demonstration data.
arXiv Detail & Related papers (2023-09-13T01:12:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.