Hierarchical Balance Packing: Towards Efficient Supervised Fine-tuning for Long-Context LLM
- URL: http://arxiv.org/abs/2503.07680v1
- Date: Mon, 10 Mar 2025 10:52:50 GMT
- Title: Hierarchical Balance Packing: Towards Efficient Supervised Fine-tuning for Long-Context LLM
- Authors: Yongqiang Yao, Jingru Tan, Kaihuan Liang, Feizhao Zhang, Yazhe Niu, Jiahao Hu, Ruihao Gong, Dahua Lin, Ningyi Xu,
- Abstract summary: Training Long-Context Large Language Models (LLMs) is challenging, as hybrid training with long-context and short-context data often leads to workload imbalances.<n>Existing works mainly use data packing to alleviate this issue but fail to consider imbalanced attention computation and wasted communication overhead.<n>This paper proposes Hierarchical Balance Packing (HBP), which designs a novel batch-construction method and training recipe to address those inefficiencies.
- Score: 45.510445021130685
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Training Long-Context Large Language Models (LLMs) is challenging, as hybrid training with long-context and short-context data often leads to workload imbalances. Existing works mainly use data packing to alleviate this issue but fail to consider imbalanced attention computation and wasted communication overhead. This paper proposes Hierarchical Balance Packing (HBP), which designs a novel batch-construction method and training recipe to address those inefficiencies. In particular, the HBP constructs multi-level data packing groups, each optimized with a distinct packing length. It assigns training samples to their optimal groups and configures each group with the most effective settings, including sequential parallelism degree and gradient checkpointing configuration. To effectively utilize multi-level groups of data, we design a dynamic training pipeline specifically tailored to HBP, including curriculum learning, adaptive sequential parallelism, and stable loss. Our extensive experiments demonstrate that our method significantly reduces training time over multiple datasets and open-source models while maintaining strong performance. For the largest DeepSeek-V2 (236B) MOE model, our method speeds up the training by 2.4$\times$ with competitive performance.
Related papers
- Federated Continual Instruction Tuning [39.344583304181135]
Federated learning (FL) has the potential to leverage all distributed data and training resources to reduce the overhead of joint training.
We introduce the Federated Continual Instruction Tuning (FCIT) benchmark to model this real-world challenge.
Our proposed method significantly enhances model performance across varying levels of data and catastrophic forgetting.
arXiv Detail & Related papers (2025-03-17T07:58:06Z) - Prior-Fitted Networks Scale to Larger Datasets When Treated as Weak Learners [82.72552644267724]
BoostPFN can outperform standard PFNs with the same size of training samples in large datasets.
High performance is maintained for up to 50x of the pre-training size of PFNs.
arXiv Detail & Related papers (2025-03-03T07:31:40Z) - Ensembles of Low-Rank Expert Adapters [9.599957499802446]
We propose the Ensembles of Low-Rank Expert Adapters (ELREA) framework to improve the model's capability to handle diverse tasks.<n>ELREA clusters the training instructions based on their gradient directions, representing different areas of expertise.<n>During inference, ELREA combines predictions from the most relevant expert adapters based on the input data's gradient similarity to the training clusters.
arXiv Detail & Related papers (2025-01-31T18:07:21Z) - Optimizing Pretraining Data Mixtures with LLM-Estimated Utility [52.08428597962423]
Large Language Models improve with increasing amounts of high-quality training data.
We find token-counts outperform manual and learned mixes, indicating that simple approaches for dataset size and diversity are surprisingly effective.
We propose two complementary approaches: UtiliMax, which extends token-based $200s by incorporating utility estimates from reduced-scale ablations, achieving up to a 10.6x speedup over manual baselines; and Model Estimated Data Utility (MEDU), which leverages LLMs to estimate data utility from small samples, matching ablation-based performance while reducing computational requirements by $simx.
arXiv Detail & Related papers (2025-01-20T21:10:22Z) - Aligning Instruction Tuning with Pre-training [81.4748965653345]
We propose Aligning Instruction Tuning with Pre-training (AITP) to align instruction tuning with pre-training distributions.<n>We show consistent performance improvements with AITP on three fully open large language models (LLMs) across eight benchmarks.
arXiv Detail & Related papers (2025-01-16T08:27:40Z) - Demystifying Workload Imbalances in Large Transformer Model Training over Variable-length Sequences [31.232756326457277]
We develop Hydraulis, which jointly optimize the parallel strategies and data assignment.<n> Empirical results demonstrate that Hydraulis outperforms existing systems by 1.32-2.66 times.
arXiv Detail & Related papers (2024-12-10T20:01:53Z) - Efficient Bias Mitigation Without Privileged Information [14.21628601482357]
Deep neural networks trained via empirical risk minimisation often exhibit significant performance disparities across groups.
Existing bias mitigation methods that aim to address this issue often rely on group labels for training or validation.
We propose Targeted Augmentations for Bias Mitigation (TAB), a framework that leverages the entire training history of a helper model to identify spurious samples.
We show that TAB improves worst-group performance without any group information or model selection, outperforming existing methods while maintaining overall accuracy.
arXiv Detail & Related papers (2024-09-26T09:56:13Z) - Long Context Alignment with Short Instructions and Synthesized Positions [56.1267385315404]
This paper introduces Step-Skipping Alignment (SkipAlign)
It is a new technique designed to enhance the long-context capabilities of Large Language Models (LLMs)
With a careful selection of the base model and alignment datasets, SkipAlign with only 6B parameters achieves it's best performance and comparable with strong baselines like GPT-3.5-Turbo-16K on LongBench.
arXiv Detail & Related papers (2024-05-07T01:56:22Z) - Effective Long-Context Scaling of Foundation Models [90.57254298730923]
We present a series of long-context LLMs that support effective context windows of up to 32,768 tokens.
Our models achieve consistent improvements on most regular tasks and significant improvements on long-context tasks over Llama 2.
arXiv Detail & Related papers (2023-09-27T21:41:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.