Related papers: Prompt-Tuning Decision Transformer with Preference Ranking

Prompt-Tuning Decision Transformer with Preference Ranking

URL: http://arxiv.org/abs/2305.09648v1
Date: Tue, 16 May 2023 17:49:04 GMT
Title: Prompt-Tuning Decision Transformer with Preference Ranking
Authors: Shengchao Hu, Li Shen, Ya Zhang, Dacheng Tao
Abstract summary: We propose the Prompt-Tuning DT algorithm to address challenges by using trajectory segments as prompts to guide RL agents in acquiring environmental information. Our approach involves randomly sampling a Gaussian distribution to fine-tune the elements of the prompt trajectory and using preference ranking function to find the optimization direction. Our work contributes to the advancement of prompt-tuning approaches in RL, providing a promising direction for optimizing large RL agents for specific preference tasks.
Score: 83.76329715043205
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Prompt-tuning has emerged as a promising method for adapting pre-trained models to downstream tasks or aligning with human preferences. Prompt learning is widely used in NLP but has limited applicability to RL due to the complex physical meaning and environment-specific information contained within RL prompts. These factors require supervised learning to imitate the demonstrations and may result in a loss of meaning after learning. Additionally, directly extending prompt-tuning approaches to RL is challenging because RL prompts guide agent behavior based on environmental modeling and analysis, rather than filling in missing information, making it unlikely that adjustments to the prompt format for downstream tasks, as in NLP, can yield significant improvements. In this work, we propose the Prompt-Tuning DT algorithm to address these challenges by using trajectory segments as prompts to guide RL agents in acquiring environmental information and optimizing prompts via black-box tuning to enhance their ability to contain more relevant information, thereby enabling agents to make better decisions. Our approach involves randomly sampling a Gaussian distribution to fine-tune the elements of the prompt trajectory and using preference ranking function to find the optimization direction, thereby providing more informative prompts and guiding the agent towards specific preferences in the target environment. Extensive experiments show that with only 0.03% of the parameters learned, Prompt-Tuning DT achieves comparable or even better performance than full-model fine-tuning in low-data scenarios. Our work contributes to the advancement of prompt-tuning approaches in RL, providing a promising direction for optimizing large RL agents for specific preference tasks.

Related papers

Can Prompt Difficulty be Online Predicted for Accelerating RL Finetuning of Reasoning Models? [62.579951798437115]
This work investigates iterative approximate evaluation for arbitrary prompts.<n>It introduces Model Predictive Prompt Selection (MoPPS), a Bayesian risk-predictive framework.<n>MoPPS reliably predicts prompt difficulty and accelerates training with significantly reduced rollouts.
arXiv Detail & Related papers (2025-07-07T03:20:52Z)
All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning [40.93098780862429]
We show that the strongest results in foundation model fine-tuning (FT) are achieved via a relatively complex, two-stage training procedure. One first trains a reward model (RM) on some dataset (e.g. human preferences) before using it to provide online feedback as part of a downstream reinforcement learning procedure. We find the most support for the explanation that on problems with a generation-verification gap, the combination of the ease of learning the relatively simple RM from the preference data, and the ability of the downstream RL procedure to then filter its search space to the subset of policies that are optimal for
arXiv Detail & Related papers (2025-03-03T00:15:19Z)
Prompt-Tuning Bandits: Enabling Few-Shot Generalization for Efficient Multi-Task Offline RL [2.6731152954002924]
We propose a lightweight, inference-time, bandit-based prompt-tuning framework.<n>The bandit explores and optimize trajectory prompt selection to enhance task performance, while avoiding costly fine-tuning of the transformer backbone.<n>Our experiments indicate not only clear performance gains due to bandit-based prompt-tuning, but also better sample complexity, scalability, and prompt space exploration.
arXiv Detail & Related papers (2025-02-10T11:20:10Z)
Are Longer Prompts Always Better? Prompt Selection in Large Language Models for Recommendation Systems [2.3650193864974978]
We study the relationship between prompts and dataset characteristics in recommendation accuracy. We propose a prompt selection method that achieves higher accuracy with minimal validation data.
arXiv Detail & Related papers (2024-12-19T02:09:59Z)
GRL-Prompt: Towards Knowledge Graph based Prompt Optimization via Reinforcement Learning [8.307785339429863]
We propose a novel framework for prompt optimization for large language models (LLMs) GRL-Prompt aims to automatically construct optimal prompts via reinforcement learning (RL) in an end-to-end manner. Experiments show that GRL-Prompt outperforms recent state-of-the-art methods.
arXiv Detail & Related papers (2024-11-19T10:52:25Z)
Prompt Tuning with Diffusion for Few-Shot Pre-trained Policy Generalization [55.14484317645865]
We develop a conditional diffusion model to produce exceptional quality prompts for offline reinforcement learning tasks. We show that the Prompt diffuser is a robust and effective tool for the prompt-tuning process, demonstrating strong performance in the meta-RL tasks.
arXiv Detail & Related papers (2024-11-02T07:38:02Z)
In-context Demonstration Matters: On Prompt Optimization for Pseudo-Supervision Refinement [71.60563181678323]
Large language models (LLMs) have achieved great success across diverse tasks, and fine-tuning is sometimes needed to further enhance generation quality.<n>To handle these challenges, a direct solution is to generate high-confidence'' data from unsupervised downstream tasks.<n>We propose a novel approach, pseudo-supervised demonstrations aligned prompt optimization (PAPO) algorithm, which jointly refines both the prompt and the overall pseudo-supervision.
arXiv Detail & Related papers (2024-10-04T03:39:28Z)
How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback. Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities. We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z)
Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts [95.09994361995389]
Relative Preference Optimization (RPO) is designed to discern between more and less preferred responses derived from both identical and related prompts. RPO has demonstrated a superior ability to align large language models with user preferences and to improve their adaptability during the training process.
arXiv Detail & Related papers (2024-02-12T22:47:57Z)
Query-Dependent Prompt Evaluation and Optimization with Offline Inverse RL [62.824464372594576]
We aim to enhance arithmetic reasoning ability of Large Language Models (LLMs) through zero-shot prompt optimization. We identify a previously overlooked objective of query dependency in such optimization. We introduce Prompt-OIRL, which harnesses offline inverse reinforcement learning to draw insights from offline prompting demonstration data.
arXiv Detail & Related papers (2023-09-13T01:12:52Z)
Approximated Prompt Tuning for Vision-Language Pre-trained Models [54.326232586461614]
In vision-language pre-trained models, prompt tuning often requires a large number of learnable tokens to bridge the gap between the pre-training and downstream tasks. We propose a novel Approximated Prompt Tuning (APT) approach towards efficient VL transfer learning.
arXiv Detail & Related papers (2023-06-27T05:43:47Z)
Automatic tuning of hyper-parameters of reinforcement learning algorithms using Bayesian optimization with behavioral cloning [0.0]
In reinforcement learning (RL), the information content of data gathered by the learning agent is dependent on the setting of many hyper- parameters. In this work, a novel approach for autonomous hyper- parameter setting using Bayesian optimization is proposed. Experiments reveal promising results compared to other manual tweaking and optimization-based approaches.
arXiv Detail & Related papers (2021-12-15T13:10:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.