Related papers: Sparse Upcycling: Inference Inefficient Finetuning

Sparse Upcycling: Inference Inefficient Finetuning

URL: http://arxiv.org/abs/2411.08968v1
Date: Wed, 13 Nov 2024 19:02:36 GMT
Title: Sparse Upcycling: Inference Inefficient Finetuning
Authors: Sasha Doubov, Nikhil Sardana, Vitaliy Chiley,
Abstract summary: We show that sparse upcycling can achieve better quality, with improvements of over 20% relative to continued pretraining (CPT) in certain scenarios. However, this comes with a significant inference cost, leading to 40% slowdowns in high-demand inference settings for larger models.
Score: 4.988895645799531
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Small, highly trained, open-source large language models are widely used due to their inference efficiency, but further improving their quality remains a challenge. Sparse upcycling is a promising approach that transforms a pretrained dense model into a Mixture-of-Experts (MoE) architecture, increasing the model's parameter count and quality. In this work, we compare the effectiveness of sparse upcycling against continued pretraining (CPT) across different model sizes, compute budgets, and pretraining durations. Our experiments show that sparse upcycling can achieve better quality, with improvements of over 20% relative to CPT in certain scenarios. However, this comes with a significant inference cost, leading to 40% slowdowns in high-demand inference settings for larger models. Our findings highlight the trade-off between model quality and inference efficiency, offering insights for practitioners seeking to balance model quality and deployment constraints.

Related papers

EpiCoDe: Boosting Model Performance Beyond Training with Extrapolation and Contrastive Decoding [50.29046178980637]
EpiCoDe is a method that boosts model performance in data-scarcity scenarios without extra training.<n>We show that EpiCoDe consistently outperforms existing methods with significant and robust improvement.
arXiv Detail & Related papers (2025-06-04T02:11:54Z)
An Effective Training Framework for Light-Weight Automatic Speech Recognition Models [10.295690160466936]
We introduce an efficacious two-step representation learning based approach capable of producing several small sized models from a single large model.<n>Our approach achieves three-fold training speed-up and up to 12.54% word error rate improvement.
arXiv Detail & Related papers (2025-05-22T17:55:09Z)
Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization [18.271311365080802]
Mixture of Experts (MoE) architecture reduces the training and inference cost significantly compared to a dense model of equivalent capacity. Upcycling is an approach that initializes and trains an MoE model using a pre-trained dense model. Drop-Upcycling combines two seemingly contradictory approaches: utilizing the knowledge of pre-trained dense models while statistically re-initializing some parts of the weights.
arXiv Detail & Related papers (2025-02-26T16:06:36Z)
Training Language Models to Reason Efficiently [14.390800014819439]
We use reinforcement learning to train large reasoning models to reason efficiently. Our method incentivizes models to minimize unnecessary computational overhead while maintaining accuracy. Experiments on two open-weight large reasoning models demonstrate significant reductions in inference cost while preserving most of the accuracy.
arXiv Detail & Related papers (2025-02-06T19:18:16Z)
STLM Engineering Report: Dropout [4.3600359083731695]
We find that dropout remains effective in the overfitting scenario, and that it may have some relevance for improving the fit of models even in the case of excess data. In the process we find that the existing explanation for the mechanism behind this performance gain is not applicable in the case of language modelling.
arXiv Detail & Related papers (2024-09-09T08:24:29Z)
Co-training and Co-distillation for Quality Improvement and Compression of Language Models [88.94539115180919]
Knowledge Distillation (KD) compresses expensive pre-trained language models (PLMs) by transferring their knowledge to smaller models. Most smaller models fail to surpass the performance of the original larger model, resulting in sacrificing performance to improve inference speed. We propose Co-Training and Co-Distillation (CTCD), a novel framework that improves performance and inference speed together by co-training two models.
arXiv Detail & Related papers (2023-11-06T03:29:00Z)
QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement. QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights. We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z)
A Multi-dimensional Evaluation of Tokenizer-free Multilingual Pretrained Models [87.7086269902562]
We show that subword-based models might still be the most practical choice in many settings. We encourage future work in tokenizer-free methods to consider these factors when designing and evaluating new models.
arXiv Detail & Related papers (2022-10-13T15:47:09Z)
Feeding What You Need by Understanding What You Learned [54.400455868448695]
Machine Reading (MRC) reveals the ability to understand a given text passage and answer questions based on it. Existing research works in MRC rely heavily on large-size models and corpus to improve the performance evaluated by metrics such as Exact Match. We argue that a deep understanding of model capabilities and data properties can help us feed a model with appropriate training data.
arXiv Detail & Related papers (2022-03-05T14:15:59Z)
Knowledge Distillation for Quality Estimation [79.51452598302934]
Quality Estimation (QE) is the task of automatically predicting Machine Translation quality in the absence of reference translations. Recent success in QE stems from the use of multilingual pre-trained representations, where very large models lead to impressive results. We show that this approach, in combination with data augmentation, leads to light-weight QE models that perform competitively with distilled pre-trained representations with 8x fewer parameters.
arXiv Detail & Related papers (2021-07-01T12:36:21Z)
MixKD: Towards Efficient Distillation of Large-scale Language Models [129.73786264834894]
We propose MixKD, a data-agnostic distillation framework, to endow the resulting model with stronger generalization ability. We prove from a theoretical perspective that under reasonable conditions MixKD gives rise to a smaller gap between the error and the empirical error. Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach.
arXiv Detail & Related papers (2020-11-01T18:47:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.