Dynamic Data Mixing Maximizes Instruction Tuning for Mixture-of-Experts
        - URL: http://arxiv.org/abs/2406.11256v1
- Date: Mon, 17 Jun 2024 06:47:03 GMT
- Title: Dynamic Data Mixing Maximizes Instruction Tuning for Mixture-of-Experts
- Authors: Tong Zhu, Daize Dong, Xiaoye Qu, Jiacheng Ruan, Wenliang Chen, Yu Cheng, 
- Abstract summary: We propose a novel dynamic data mixture for MoE instruction tuning.
Inspired by MoE's token routing preference, we build dataset-level representations and then capture the subtle differences among datasets.
Results on two MoE models demonstrate the effectiveness of our approach on both downstream knowledge & reasoning tasks and open-ended queries.
- Score: 20.202031878825153
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract:   Mixture-of-Experts (MoE) models have shown remarkable capability in instruction tuning, especially when the number of tasks scales. However, previous methods simply merge all training tasks (e.g. creative writing, coding, and mathematics) and apply fixed sampling weights, without considering the importance of different tasks as the model training state changes. In this way, the most helpful data cannot be effectively distinguished, leading to suboptimal model performance. To reduce the potential redundancies of datasets, we make the first attempt and propose a novel dynamic data mixture for MoE instruction tuning. Specifically, inspired by MoE's token routing preference, we build dataset-level representations and then capture the subtle differences among datasets. Finally, we propose to dynamically adjust the sampling weight of datasets by their inter-redundancies, thus maximizing global performance under a limited training budget. The experimental results on two MoE models demonstrate the effectiveness of our approach on both downstream knowledge \& reasoning tasks and open-ended queries. Code and models are available at https://github.com/Spico197/MoE-SFT . 
 
      
        Related papers
        - LLM Data Selection and Utilization via Dynamic Bi-level Optimization [100.20933466418786]
 We propose a new Data Weighting Model (DWM) to adjust the weight of selected data within each batch to achieve a dynamic data utilization during training.<n>Our experiments demonstrate that DWM enhances the performance of models trained with randomly-selected data.<n>We further analyze how a model's data preferences evolve throughout training, providing new insights into the data preference of the model during training.
 arXiv  Detail & Related papers  (2025-07-22T02:47:12Z)
- Multimodal-Guided Dynamic Dataset Pruning for Robust and Efficient   Data-Centric Learning [49.10890099624699]
 We introduce a dynamic dataset pruning framework that adaptively selects training samples based on task-driven difficulty and cross-modality semantic consistency.<n>Our work highlights the potential of integrating cross-modality alignment for robust sample selection, advancing data-centric learning toward more efficient and robust practices across application domains.
 arXiv  Detail & Related papers  (2025-07-17T03:08:26Z)
- ToReMi: Topic-Aware Data Reweighting for Dynamic Pre-Training Data   Selection [28.75333303894706]
 ToReMi is a novel framework that adjusts training sample weights according to their topical associations and observed learning patterns.
Our experiments reveal that ToReMi variants consistently achieve superior performance over conventional pre-training approaches.
 arXiv  Detail & Related papers  (2025-04-01T12:06:42Z)
- Mixture-of-Skills: Learning to Optimize Data Usage for Fine-Tuning Large   Language Models [45.51085356985464]
 Large language models (LLMs) are typically fine-tuned on diverse and extensive datasets sourced from various origins.
MoS learns to optimize data usage automatically during the fine-tuning process.
MoSpec harnesses the utilities of various datasets for a specific purpose.
 arXiv  Detail & Related papers  (2024-06-13T05:01:28Z)
- Diffusion-Based Neural Network Weights Generation [80.89706112736353]
 D2NWG is a diffusion-based neural network weights generation technique that efficiently produces high-performing weights for transfer learning.
Our method extends generative hyper-representation learning to recast the latent diffusion paradigm for neural network weights generation.
Our approach is scalable to large architectures such as large language models (LLMs), overcoming the limitations of current parameter generation techniques.
 arXiv  Detail & Related papers  (2024-02-28T08:34:23Z)
- Task-customized Masked AutoEncoder via Mixture of Cluster-conditional
  Experts [104.9871176044644]
 Masked Autoencoder(MAE) is a prevailing self-supervised learning method that achieves promising results in model pre-training.
We propose a novel MAE-based pre-training paradigm, Mixture of Cluster-conditional Experts (MoCE)
MoCE trains each expert only with semantically relevant images by using cluster-conditional gates.
 arXiv  Detail & Related papers  (2024-02-08T03:46:32Z)
- LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
 We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
 arXiv  Detail & Related papers  (2024-02-06T19:18:04Z)
- Efficient Grammatical Error Correction Via Multi-Task Training and
  Optimized Training Schedule [55.08778142798106]
 We propose auxiliary tasks that exploit the alignment between the original and corrected sentences.
We formulate each task as a sequence-to-sequence problem and perform multi-task training.
We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
 arXiv  Detail & Related papers  (2023-11-20T14:50:12Z)
- GistScore: Learning Better Representations for In-Context Example
  Selection with Gist Bottlenecks [3.9638110494107095]
 In-context Learning (ICL) is the ability of Large Language Models (LLMs) to perform new tasks when conditioned on prompts.
We propose Example Gisting, a novel approach for training example encoders through supervised fine-tuning.
We show that our fine-tuned models get state-of-the-art ICL performance with over 20% absolute gain over off-the-shelf retrievers.
 arXiv  Detail & Related papers  (2023-11-16T06:28:05Z)
- Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning [47.02160072880698]
 We introduce a self-evolving mechanism that allows the model itself to actively sample subsets that are equally or even more effective.
The key to our data sampling technique lies in the enhancement of diversity in the chosen subsets.
Extensive experiments across three datasets and benchmarks demonstrate the effectiveness of DiverseEvol.
 arXiv  Detail & Related papers  (2023-11-14T14:10:40Z)
- Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for
  Large Language Models [125.91897197446379]
 We find that MoE models benefit more from instruction tuning than dense models.
Our most powerful model, FLAN-MOE-32B, surpasses the performance of FLAN-PALM-62B on four benchmark tasks.
 arXiv  Detail & Related papers  (2023-05-24T04:22:26Z)
- FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via
  Dynamic Device Placement [19.639936387834677]
 Mixture-of-Experts (MoEs) are becoming more popular and have demonstrated impressive pretraining scalability in various downstream tasks.
MoEs are becoming a new data analytics paradigm in the data life cycle and suffering from unique challenges at scales, complexities, and granularities never before possible.
In this paper, we propose a novel DNN training framework, FlexMoE, which systematically and transparently address the inefficiency caused by dynamic dataflow.
 arXiv  Detail & Related papers  (2023-04-08T07:34:26Z)
- Continual Learning with Optimal Transport based Mixture Model [17.398605698033656]
 We propose an online mixture model learning approach based on nice properties of the mature optimal transport theory (OT-MM)
Our proposed method can significantly outperform the current state-of-the-art baselines.
 arXiv  Detail & Related papers  (2022-11-30T06:40:29Z)
- MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided
  Adaptation [68.30497162547768]
 We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed.
We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
 arXiv  Detail & Related papers  (2022-04-15T23:19:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.