OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of Instruction Data
- URL: http://arxiv.org/abs/2404.12195v1
- Date: Thu, 18 Apr 2024 13:57:18 GMT
- Title: OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of Instruction Data
- Authors: Chandeepa Dissanayake, Lahiru Lowe, Sachith Gunasekara, Yasiru Ratnayake,
- Abstract summary: In this work, using OpenLLaMA 3Bv2 as a base model, we describe the recipe used to fine-tune the OpenBezoar family of models.
We first generate synthetic instruction fine-tuning data using an open and commercially non-restrictive instruction fine-tuned variant of the Falcon-40B model.
We then perform cost-effective QLoRA-based supervised fine-tuning sequentially with each scheme.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Instruction fine-tuning pretrained LLMs for diverse downstream tasks has demonstrated remarkable success and has captured the interest of both academics and practitioners. To ensure such fine-tuned LLMs align with human preferences, techniques such as RLHF and DPO have emerged. At the same time, there is increasing interest in smaller parameter counts for models. In this work, using OpenLLaMA 3Bv2 as a base model, we describe the recipe used to fine-tune the OpenBezoar family of models. In this recipe: We first generate synthetic instruction fine-tuning data using an open and commercially non-restrictive instruction fine-tuned variant of the Falcon-40B model under three schemes based on: LaMini-LM, WizardLM/Evol-Instruct (with databricks-dolly-15k as a seed dataset) and Orca (with the Flan Collection as a seed dataset), then filter these generations using GPT-4 as a human proxy. We then perform cost-effective QLoRA-based supervised fine-tuning sequentially with each scheme. The resulting checkpoint is further fine-tuned with a subset of the HH-RLHF dataset to minimize distribution shift prior to using the DPO loss to obtain the final checkpoint. Evaluation is done with the LM Eval Harness tasks/metrics as well as on MT-Bench using the "LLM-as-a-judge" framework with Claude 2.1, with the finding that the final checkpoint, "OpenBezoar-HH-RLHF-DPO", demonstrates superior performance over many models at the 3B parameter scale, even outperforming the top model in one of the categories on the Huggingface Open LLM Leaderboard. We release "OpenBezoar-SFT", "OpenBezoar-HH-RLHF-SFT", "OpenBezoar-HH-RLHF-DPO" checkpoints, alongside our generated datasets on HuggingFace at https://huggingface.co/collections/SurgeGlobal/open-bezoar-6620a24923e12127e9e2b9cc and our codebase at https://bitbucket.org/paladinanalytics/workspace/projects/OP.
Related papers
- Streaming Looking Ahead with Token-level Self-reward [50.699168440048716]
We propose a policy model with token-level self-reward modeling (TRM) capability to eliminate the need for external models and extra communication.
In addition, we propose a streaming-looking-ahead (SLA) algorithm to further boost search efficiency with better parallelization.
If we combine SLA with reinforcement fine-tuning techniques such as DPO, SLA achieves an overall win rate of 89.4%.
arXiv Detail & Related papers (2025-02-24T22:35:53Z) - Practical Secure Inference Algorithm for Fine-tuned Large Language Model Based on Fully Homomorphic Encryption [0.0]
We combine Fully Homomorphic Encryption(FHE) and provable security theory with Fine-Tuning(PEFT) to propose an efficient and secure inference scheme for large language models.
In this paper, we use the open-source model ChatGLM2-6B as the base model which is fine-tuned by LoRA.
Experimental results show the inference efficiency of our scheme reaches 1.61s/ which displays that the scheme has good practicality.
arXiv Detail & Related papers (2025-01-03T07:19:23Z) - How to Evaluate Reward Models for RLHF [51.31240621943791]
We introduce a new benchmark for reward models that quantifies their ability to produce strong language models through RLHF (Reinforcement Learning from Human Feedback)
We build a predictive model of downstream LLM performance by evaluating the reward model on proxy tasks.
We launch an end-to-end RLHF experiment on a large-scale crowdsourced human preference platform to view real reward model downstream performance as ground truth.
arXiv Detail & Related papers (2024-10-18T21:38:21Z) - Bootstrapping Language Models with DPO Implicit Rewards [45.68366127605774]
Direct preference optimization (DPO) has greatly simplified the process from past work in reinforcement learning from human feedback.
In this work, we make a novel observation that this implicit reward model can by itself be used in a bootstrapping fashion to further align the LLM.
Our approach, named self-alignment with DPO ImpliCit rEwards (DICE), shows great improvements in alignment and achieves superior performance.
arXiv Detail & Related papers (2024-06-14T06:57:18Z) - Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing [48.07915731998946]
We present a self-synthesis method for generating large-scale alignment data named Magpie.
We use this method to prompt Llama-3-Instruct and generate 4 million instructions along with their corresponding responses.
Our results indicate that in some tasks, models fine-tuned with Magpie perform comparably to the official Llama-3-8B-Instruct.
arXiv Detail & Related papers (2024-06-12T17:52:30Z) - RLHF Workflow: From Reward Modeling to Online RLHF [79.83927049253924]
We present the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF) in this technical report.
RLHF is widely reported to outperform its offline counterpart by a large margin in the recent large language model (LLM) literature.
We show that supervised fine-tuning (SFT) and iterative RLHF can obtain state-of-the-art performance with fully open-source datasets.
arXiv Detail & Related papers (2024-05-13T15:50:39Z) - Labeling supervised fine-tuning data with the scaling law [0.0]
This paper introduces a multi-stage manual annotation by the scaling law, offering a high-quality Supervised Fine-Tuning data acquisition method.
We have preprocessed 58k authentic chat data and manually annotated 2.3k questions.
We conducted fine-tuning on Qwen models, ranging from 0.5B to 32B parameters, and the optimal version improved 29.07 in F1 score.
arXiv Detail & Related papers (2024-05-05T05:43:20Z) - Weak-to-Strong Extrapolation Expedites Alignment [135.12769233630362]
We propose a method called ExPO to boost models' alignment with human preference.
We demonstrate that ExPO consistently improves off-the-shelf DPO/RLHF models.
We shed light on the essence of ExPO amplifying the reward signal learned during alignment training.
arXiv Detail & Related papers (2024-04-25T17:39:50Z) - Pandora's White-Box: Precise Training Data Detection and Extraction in Large Language Models [4.081098869497239]
We develop state-of-the-art privacy attacks against Large Language Models (LLMs)
New membership inference attacks (MIAs) against pretrained LLMs perform hundreds of times better than baseline attacks.
In fine-tuning, we find that a simple attack based on the ratio of the loss between the base and fine-tuned models is able to achieve near-perfect MIA performance.
arXiv Detail & Related papers (2024-02-26T20:41:50Z) - Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models [52.98743860365194]
We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN)
At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself.
This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents.
arXiv Detail & Related papers (2024-01-02T18:53:13Z) - Zephyr: Direct Distillation of LM Alignment [59.03530095974505]
We aim to produce a smaller language model that is aligned to user intent.
Previous research has shown that applying supervised fine-tuning (dSFT) on larger models significantly improves task accuracy.
We apply distilled direct preference optimization (dDPO) to learn a chat model with significantly improved intent alignment.
arXiv Detail & Related papers (2023-10-25T19:25:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.