Few-shot In-Context Preference Learning Using Large Language Models
- URL: http://arxiv.org/abs/2410.17233v1
- Date: Tue, 22 Oct 2024 17:53:34 GMT
- Title: Few-shot In-Context Preference Learning Using Large Language Models
- Authors: Chao Yu, Hong Lu, Jiaxuan Gao, Qixin Tan, Xinting Yang, Yu Wang, Yi Wu, Eugene Vinitsky,
- Abstract summary: Designing reward functions is a core component of reinforcement learning.
It can be exceedingly inefficient to learn rewards as they are often learned tabula rasa.
We propose In-Context Preference Learning (ICPL) to accelerate learning reward functions from preferences.
- Score: 15.84585737510038
- License:
- Abstract: Designing reward functions is a core component of reinforcement learning but can be challenging for truly complex behavior. Reinforcement Learning from Human Feedback (RLHF) has been used to alleviate this challenge by replacing a hand-coded reward function with a reward function learned from preferences. However, it can be exceedingly inefficient to learn these rewards as they are often learned tabula rasa. We investigate whether Large Language Models (LLMs) can reduce this query inefficiency by converting an iterative series of human preferences into code representing the rewards. We propose In-Context Preference Learning (ICPL), a method that uses the grounding of an LLM to accelerate learning reward functions from preferences. ICPL takes the environment context and task description, synthesizes a set of reward functions, and then repeatedly updates the reward functions using human rankings of videos of the resultant policies. Using synthetic preferences, we demonstrate that ICPL is orders of magnitude more efficient than RLHF and is even competitive with methods that use ground-truth reward functions instead of preferences. Finally, we perform a series of human preference-learning trials and observe that ICPL extends beyond synthetic settings and can work effectively with humans-in-the-loop. Additional information and videos are provided at https://sites.google.com/view/few-shot-icpl/home.
Related papers
- MAPLE: A Framework for Active Preference Learning Guided by Large Language Models [9.37268652939886]
We introduce MAPLE, a framework for large language model-guided Bayesian active preference learning.
Our results demonstrate that MAPLE accelerates the learning process and effectively improves humans' ability to answer queries.
arXiv Detail & Related papers (2024-12-10T05:55:14Z) - Real-Time Personalization for LLM-based Recommendation with Customized In-Context Learning [57.28766250993726]
This work explores adapting to dynamic user interests without any model updates.
Existing Large Language Model (LLM)-based recommenders often lose the in-context learning ability during recommendation tuning.
We propose RecICL, which customizes recommendation-specific in-context learning for real-time recommendations.
arXiv Detail & Related papers (2024-10-30T15:48:36Z) - Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness [27.43137305486112]
We propose a novel Self-supervised Preference Optimization (SPO) framework, which constructs a self-supervised preference degree loss combined with the alignment loss.
The results demonstrate that SPO can be seamlessly integrated with existing preference optimization methods to achieve state-of-the-art performance.
arXiv Detail & Related papers (2024-09-26T12:37:26Z) - Self-Exploring Language Models: Active Preference Elicitation for Online Alignment [88.56809269990625]
We propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions.
Our experimental results demonstrate that when fine-tuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, Self-Exploring Language Models (SELM) significantly boosts the performance on instruction-following benchmarks.
arXiv Detail & Related papers (2024-05-29T17:59:07Z) - Enhancing Q-Learning with Large Language Model Heuristics [0.0]
Large language models (LLMs) can achieve zero-shot learning for simpler tasks, but they suffer from low inference speeds and occasional hallucinations.
We propose textbfLLM-guided Q-learning, a framework that leverages LLMs as hallucinations to aid in learning the Q-function for reinforcement learning.
arXiv Detail & Related papers (2024-05-06T10:42:28Z) - How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback.
Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities.
We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z) - Active Preference Learning for Large Language Models [12.093302163058436]
We develop an active learning strategy for DPO to make better use of preference labels.
We propose a practical acquisition function for prompt/completion pairs based on the predictive entropy of the language model.
We demonstrate how our approach improves both the rate of learning and final performance of fine-tuning on pairwise preference data.
arXiv Detail & Related papers (2024-02-12T23:09:00Z) - Enabling Language Models to Implicitly Learn Self-Improvement [49.16868302881804]
Large Language Models (LLMs) have demonstrated remarkable capabilities in open-ended text generation tasks.
We propose an ImPlicit Self-ImprovemenT (PIT) framework that implicitly learns the improvement goal from human preference data.
arXiv Detail & Related papers (2023-10-02T04:29:40Z) - Query-Dependent Prompt Evaluation and Optimization with Offline Inverse
RL [62.824464372594576]
We aim to enhance arithmetic reasoning ability of Large Language Models (LLMs) through zero-shot prompt optimization.
We identify a previously overlooked objective of query dependency in such optimization.
We introduce Prompt-OIRL, which harnesses offline inverse reinforcement learning to draw insights from offline prompting demonstration data.
arXiv Detail & Related papers (2023-09-13T01:12:52Z) - OverPrompt: Enhancing ChatGPT through Efficient In-Context Learning [49.38867353135258]
We propose OverPrompt, leveraging the in-context learning capability of LLMs to handle multiple task inputs.
Our experiments show that OverPrompt can achieve cost-efficient zero-shot classification without causing significant detriment to task performance.
arXiv Detail & Related papers (2023-05-24T10:08:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.