Rewarded soups: towards Pareto-optimal alignment by interpolating
weights fine-tuned on diverse rewards
- URL: http://arxiv.org/abs/2306.04488v2
- Date: Mon, 16 Oct 2023 13:53:14 GMT
- Title: Rewarded soups: towards Pareto-optimal alignment by interpolating
weights fine-tuned on diverse rewards
- Authors: Alexandre Ram\'e, Guillaume Couairon, Mustafa Shukor, Corentin
Dancette, Jean-Baptiste Gaya, Laure Soulier and Matthieu Cord
- Abstract summary: Foundation models are first pre-trained on vast unsupervised datasets and then fine-tuned on labeled data.
We propose embracing the heterogeneity of diverse rewards by following a multi-policy strategy.
We demonstrate the effectiveness of our approach for text-to-text (summarization, Q&A, helpful assistant, review), text-image (image captioning, text-to-image generation, visual grounding, VQA), and control (locomotion) tasks.
- Score: 101.7246658985579
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Foundation models are first pre-trained on vast unsupervised datasets and
then fine-tuned on labeled data. Reinforcement learning, notably from human
feedback (RLHF), can further align the network with the intended usage. Yet the
imperfections in the proxy reward may hinder the training and lead to
suboptimal results; the diversity of objectives in real-world tasks and human
opinions exacerbate the issue. This paper proposes embracing the heterogeneity
of diverse rewards by following a multi-policy strategy. Rather than focusing
on a single a priori reward, we aim for Pareto-optimal generalization across
the entire space of preferences. To this end, we propose rewarded soup, first
specializing multiple networks independently (one for each proxy reward) and
then interpolating their weights linearly. This succeeds empirically because we
show that the weights remain linearly connected when fine-tuned on diverse
rewards from a shared pre-trained initialization. We demonstrate the
effectiveness of our approach for text-to-text (summarization, Q&A, helpful
assistant, review), text-image (image captioning, text-to-image generation,
visual grounding, VQA), and control (locomotion) tasks. We hope to enhance the
alignment of deep models, and how they interact with the world in all its
diversity.
Related papers
- R3HF: Reward Redistribution for Enhancing Reinforcement Learning from Human Feedback [25.27230140274847]
Reinforcement learning from human feedback (RLHF) provides a paradigm for aligning large language models (LLMs) with human preferences.
This paper proposes a novel reward redistribution method called R3HF, which facilitates a more fine-grained, token-level reward allocation.
arXiv Detail & Related papers (2024-11-13T02:45:21Z) - Synergy and Diversity in CLIP: Enhancing Performance Through Adaptive Backbone Ensembling [58.50618448027103]
Contrastive Language-Image Pretraining (CLIP) stands out as a prominent method for image representation learning.
This paper explores the differences across various CLIP-trained vision backbones.
Method achieves a remarkable increase in accuracy of up to 39.1% over the best single backbone.
arXiv Detail & Related papers (2024-05-27T12:59:35Z) - Multi-turn Reinforcement Learning from Preference Human Feedback [41.327438095745315]
Reinforcement Learning from Human Feedback (RLHF) has become the standard approach for aligning Large Language Models with human preferences.
Existing methods work by emulating the preferences at the single decision (turn) level.
We develop novel methods for Reinforcement Learning from preference feedback between two full multi-turn conversations.
arXiv Detail & Related papers (2024-05-23T14:53:54Z) - MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with
Diverse Human Preferences [101.57443597426374]
Reinforcement Learning from Human Feedback (RLHF) aligns language models to human preferences by employing a singular reward model derived from preference data.
We learn a mixture of preference distributions via an expectation-maximization algorithm to better represent diverse human preferences.
Our algorithm achieves an average improvement of more than 16% in win-rates over conventional RLHF algorithms.
arXiv Detail & Related papers (2024-02-14T03:56:27Z) - Dense Reward for Free in Reinforcement Learning from Human Feedback [64.92448888346125]
We leverage the fact that the reward model contains more information than just its scalar output.
We use these attention weights to redistribute the reward along the whole completion.
Empirically, we show that it stabilises training, accelerates the rate of learning, and, in practical cases, may lead to better local optima.
arXiv Detail & Related papers (2024-02-01T17:10:35Z) - REBEL: A Regularization-Based Solution for Reward Overoptimization in Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and user intentions, values, or social norms can be catastrophic in the real world.
Current methods to mitigate this misalignment work by learning reward functions from human preferences.
We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z) - Improving Generalization of Alignment with Human Preferences through
Group Invariant Learning [56.19242260613749]
Reinforcement Learning from Human Feedback (RLHF) enables the generation of responses more aligned with human preferences.
Previous work shows that Reinforcement Learning (RL) often exploits shortcuts to attain high rewards and overlooks challenging samples.
We propose a novel approach that can learn a consistent policy via RL across various data groups or domains.
arXiv Detail & Related papers (2023-10-18T13:54:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.