Continual SFT Matches Multimodal RLHF with Negative Supervision
- URL: http://arxiv.org/abs/2411.14797v1
- Date: Fri, 22 Nov 2024 08:48:30 GMT
- Title: Continual SFT Matches Multimodal RLHF with Negative Supervision
- Authors: Ke Zhu, Yu Wang, Yanpeng Sun, Qiang Chen, Jiangjiang Liu, Gang Zhang, Jingdong Wang,
- Abstract summary: Multimodal RLHF usually happens after supervised finetuning (SFT) stage to continually improve vision-language models' (VLMs) comprehension.
Conventional wisdom holds its superiority over continual SFT during this preference alignment stage.
We propose a novel negative supervised finetuning (nSFT) approach that fully excavates these information resided.
- Score: 32.784161582943874
- License:
- Abstract: Multimodal RLHF usually happens after supervised finetuning (SFT) stage to continually improve vision-language models' (VLMs) comprehension. Conventional wisdom holds its superiority over continual SFT during this preference alignment stage. In this paper, we observe that the inherent value of multimodal RLHF lies in its negative supervision, the logit of the rejected responses. We thus propose a novel negative supervised finetuning (nSFT) approach that fully excavates these information resided. Our nSFT disentangles this negative supervision in RLHF paradigm, and continually aligns VLMs with a simple SFT loss. This is more memory efficient than multimodal RLHF where 2 (e.g., DPO) or 4 (e.g., PPO) large VLMs are strictly required. The effectiveness of nSFT is rigorously proved by comparing it with various multimodal RLHF approaches, across different dataset sources, base VLMs and evaluation metrics. Besides, fruitful of ablations are provided to support our hypothesis. We hope this paper will stimulate further research to properly align large vision language models.
Related papers
- Training Language Models to Critique With Multi-agent Feedback [102.42751835338233]
MultiCritique pipeline improves critique ability of LLMs by utilizing multi-agent feedback.
pipeline aggregates high-quality critiques from multiple agents instead of a single model.
Our fine-tuned 7B model significantly surpasses other advanced 7B-13B open-source models.
arXiv Detail & Related papers (2024-10-20T04:57:45Z) - SFTMix: Elevating Language Model Instruction Tuning with Mixup Recipe [30.03925858123481]
We propose SFTMix, a novel recipe that elevates instruction-tuning performance beyond the conventional NTP paradigm.
Based on training dynamics, we argue that examples with different confidence levels should play distinct roles during the instruction-tuning process.
This approach enables SFTMix to significantly outperform NTP across a wide range of instruction-following and healthcare domain-specific SFT tasks.
arXiv Detail & Related papers (2024-10-07T17:52:21Z) - Quality or Quantity? On Data Scale and Diversity in Adapting Large Language Models for Low-Resource Translation [62.202893186343935]
We explore what it would take to adapt Large Language Models for low-resource languages.
We show that parallel data is critical during both pre-training andSupervised Fine-Tuning (SFT)
Our experiments with three LLMs across two low-resourced language groups reveal consistent trends, underscoring the generalizability of our findings.
arXiv Detail & Related papers (2024-08-23T00:59:38Z) - A Framework for Fine-Tuning LLMs using Heterogeneous Feedback [69.51729152929413]
We present a framework for fine-tuning large language models (LLMs) using heterogeneous feedback.
First, we combine the heterogeneous feedback data into a single supervision format, compatible with methods like SFT and RLHF.
Next, given this unified feedback dataset, we extract a high-quality and diverse subset to obtain performance increases.
arXiv Detail & Related papers (2024-08-05T23:20:32Z) - Teaching Large Language Models to Reason with Reinforcement Learning [38.17625148525193]
Reinforcement Learning from Human Feedback (textbfRLHF) has emerged as a dominant approach for aligning LLM outputs with human preferences.
Inspired by the success of RLHF, we study the performance of multiple algorithms that learn from feedback.
arXiv Detail & Related papers (2024-03-07T16:36:29Z) - RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from
Fine-grained Correctional Human Feedback [103.08766858584049]
We present RLHF-V, which enhances MLLM trustworthiness via behavior alignment from fine-grained correctional human feedback.
Experiments on five benchmarks in both automatic and human evaluation show that, RLHF-V can enable substantially more trustworthy MLLM behaviors.
arXiv Detail & Related papers (2023-12-01T11:36:08Z) - Understanding the Effects of RLHF on LLM Generalisation and Diversity [26.56388427640671]
Large language models (LLMs) fine-tuned with reinforcement learning from human feedback (RLHF) have been used in some of the most widely deployed AI models to date.
We present an analysis of how each stage of the process affects two key properties: out-of-distribution (OOD) generalisation and output diversity.
arXiv Detail & Related papers (2023-10-10T09:25:44Z) - Mitigating the Alignment Tax of RLHF [76.4300447532456]
aligning LLMs under Reinforcement Learning with Human Feedback can lead to forgetting pretrained abilities, also known as the alignment tax.
We propose model averaging to maximize alignment performance while incurring minimal alignment tax.
We validate HMA's performance across a range of RLHF algorithms over OpenLLaMA-3B and further extend our findings to Mistral-7B.
arXiv Detail & Related papers (2023-09-12T14:16:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.