Related papers: FuseChat-3.0: Preference Optimization Meets Heterogeneous Model Fusion

FuseChat-3.0: Preference Optimization Meets Heterogeneous Model Fusion

URL: http://arxiv.org/abs/2503.04222v1
Date: Thu, 06 Mar 2025 09:03:36 GMT
Title: FuseChat-3.0: Preference Optimization Meets Heterogeneous Model Fusion
Authors: Ziyi Yang, Fanqi Wan, Longguang Zhong, Canbin Huang, Guosheng Liang, Xiaojun Quan,
Abstract summary: We introduce FuseChat-3.0, a suite of large language models (LLMs) developed by integrating the strengths of heterogeneous source LLMs into more compact target LLMs.<n>For target models, we focus on three widely-used smaller variants-Llama-3.1-8B-Instruct, Gemma-2-9B-it, and Qwen-2.5-72B-Instruct.<n>The resulting FuseChat-3.0 models exhibit significant performance gains across tasks such as instruction following, general knowledge, mathematics, and coding.
Score: 32.0871035771324
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce FuseChat-3.0, a suite of large language models (LLMs) developed by integrating the strengths of heterogeneous source LLMs into more compact target LLMs. Our source models include the powerful Gemma-2-27B-it, Mistral-Large-Instruct-2407, Qwen-2.5-72B-Instruct, and Llama-3.1-70B-Instruct. For target models, we focus on three widely-used smaller variants-Llama-3.1-8B-Instruct, Gemma-2-9B-it, and Qwen-2.5-7B-Instruct-along with two ultra-compact options, Llama-3.2-3B-Instruct and Llama-3.2-1B-Instruct. To leverage the diverse capabilities of these source models, we develop a specialized data construction protocol tailored to various tasks and domains. The FuseChat-3.0 training pipeline consists of two key stages: (1) supervised fine-tuning (SFT) to align the target and source model distributions, and (2) Direct Preference Optimization (DPO) to apply preferences from multiple source LLMs to fine-tune the target model. The resulting FuseChat-3.0 models exhibit significant performance gains across tasks such as instruction following, general knowledge, mathematics, and coding. As illustrated in Figure 1, using Llama-3.1-8B-Instruct as the target model, our fusion approach achieves an average improvement of 6.8 points across 14 benchmarks. Moreover, it demonstrates remarkable gains of 37.1 points and 30.1 points on the instruction-following benchmarks AlpacaEval-2 and Arena-Hard, respectively. Our code, models, and datasets are available at https://github.com/SLIT-AI/FuseChat-3.0.

Related papers

Union of Experts: Adapting Hierarchical Routing to Equivalently Decomposed Transformer [5.585222292493927]
We propose Union-of-Experts (UoE), which decomposes transformer into an equitant group of experts, and then implement selective routing on input data and experts.<n>Experiments demonstrate that UoE model surpass Full Attention, state-of-art MoEs and efficient transformers.
arXiv Detail & Related papers (2025-03-04T11:01:25Z)
The Best Instruction-Tuning Data are Those That Fit [17.401088816596054]
Supervised fine-tuning (SFT) data are crucial for eliciting strong capabilities from pretrained large language models (LLMs)<n>We propose **GRAPE**, a novel SFT framework that accounts for the unique characteristics of the target model.<n>For each instruction, it gathers responses from various LLMs and selects the one with the highest probability measured by the target model.
arXiv Detail & Related papers (2025-02-06T16:31:21Z)
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization [65.64108848398696]
We introduce a preference optimization (PO) process to enhance the multimodal reasoning capabilities of MLLMs. Specifically, we design an automated preference data construction pipeline to create MMPR, a high-quality, large-scale multimodal reasoning preference dataset. We explore integrating PO with MLLMs, developing a simple yet effective method, termed Mixed Preference Optimization (MPO), which boosts multimodal CoT performance.
arXiv Detail & Related papers (2024-11-15T18:59:27Z)
Model-GLUE: Democratized LLM Scaling for A Large Model Zoo in the Wild [84.57103623507082]
This paper introduces Model-GLUE, a holistic Large Language Models scaling guideline. We benchmark existing scaling techniques, especially selective merging, and variants of mixture. We then formulate an optimal strategy for the selection and aggregation of a heterogeneous model zoo. Our methodology involves the clustering of mergeable models and optimal merging strategy selection, and the integration of clusters.
arXiv Detail & Related papers (2024-10-07T15:55:55Z)
FuseChat: Knowledge Fusion of Chat Models [35.90957231731829]
We propose a new framework for the knowledge fusion of chat LLMs through two main stages, resulting in FuseChat. We implement and validate FuseChat using six prominent chat LLMs with diverse architectures and scales, including OpenChat-3.5-7B, Starling-LM-7B-alpha, NH2-SOLAR-10.7B, InternLM2-Chat-20B, Mixtral-8x7B-Instruct, and Qwen-1.5-Chat-72B.
arXiv Detail & Related papers (2024-08-15T07:37:24Z)
DELLA-Merging: Reducing Interference in Model Merging through Magnitude-Based Sampling [24.270321913746233]
We propose a new model merging technique, Drop and rEscaLe via sampLing with mAgnitude (DELLA-Merging), that employs a novel pruning technique, MAGPRUNE. MAGPRUNE first ranks the parameters in order of their magnitude and assigns higher dropout probabilities (p) to parameters with lower ranks corresponding to lower magnitudes.
arXiv Detail & Related papers (2024-06-17T15:02:45Z)
Benchmarking the Performance of Pre-trained LLMs across Urdu NLP Tasks [0.9786690381850356]
This study presents in-depth examination of 7 prominent Large Language Models (LLMs) across 17 tasks using 22 datasets, 13.8 hours of speech, in a zero-shot setting, and their performance against state-of-the-art (SOTA) models.<n>Our results emphasize that models with fewer parameters but richer language-specific data, like Llama 3.1-8B, often outperform larger models with lower language diversity, such as GPT-3.5, in several tasks.
arXiv Detail & Related papers (2024-05-24T11:30:37Z)
Advancing LLM Reasoning Generalists with Preference Trees [119.57169648859707]
We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning. Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks.
arXiv Detail & Related papers (2024-04-02T16:25:30Z)
Knowledge Fusion of Chat LLMs: A Preliminary Technical Report [51.0178356903925]
We extend the FuseLLM framework to realize the fusion of chat LLMs, resulting in FusionChat. We undertake knowledge fusion for structurally and scale-varied source LLMs to derive multiple target LLMs of identical structure and size via lightweight fine-tuning. We validate our approach using three prominent chat LLMs with diverse architectures and scales, namely NH2-Mixtral-8x7B, NH2-Solar-10.7B, and OpenChat-3.5-7B.
arXiv Detail & Related papers (2024-02-25T15:11:58Z)
Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning [52.29522018586365]
We study structured pruning as an effective means to develop smaller LLMs from pre-trained, larger models. Our approach employs two key techniques: (1) targeted structured pruning, which prunes a larger model to a specified target shape by removing layers, heads, and intermediate and hidden dimensions in an end-to-end manner, and (2) dynamic batch loading, which dynamically updates the composition of sampled data in each training batch based on varying losses across different domains.
arXiv Detail & Related papers (2023-10-10T15:13:30Z)
Llama 2: Open Foundation and Fine-Tuned Chat Models [65.43397761706336]
We develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases.
arXiv Detail & Related papers (2023-07-18T14:31:57Z)
CodeGen2: Lessons for Training LLMs on Programming and Natural Languages [116.74407069443895]
We unify encoder and decoder-based models into a single prefix-LM. For learning methods, we explore the claim of a "free lunch" hypothesis. For data distributions, the effect of a mixture distribution and multi-epoch training of programming and natural languages on model performance is explored.
arXiv Detail & Related papers (2023-05-03T17:55:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.