Related papers: Exploiting Student Parallelism for Efficient GPU Inference of BERT-like Models in Online Services

Exploiting Student Parallelism for Efficient GPU Inference of BERT-like Models in Online Services

URL: http://arxiv.org/abs/2408.12526v3
Date: Tue, 18 Mar 2025 08:33:28 GMT
Title: Exploiting Student Parallelism for Efficient GPU Inference of BERT-like Models in Online Services
Authors: Weiyan Wang, Yilun Jin, Yiming Zhang, Victor Junqiu Wei, Han Tian, Li Chen, Jinbao Xue, Yangyu Tao, Di Wang, Kai Chen,
Abstract summary: We present sys for the real-world setting of GPU inference on online workloads.<n>sys adopts stacking distillation and boosting ensemble, distilling the original deep model into a group of shallow but virtually stacked student models.<n>Results show sys outperforms the baselines by $4.1timessim 1.6times$ in latency while maintaining accuracy and achieves up to $22.27times$ higher for workload bursts.
Score: 27.998951498347626
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Due to high accuracy, BERT-like models have been widely adopted by text mining and web searching. However, large BERT-like models suffer from inefficient online inference, facing the following two problems on GPUs: (1) their high accuracy relies on the large model depth, which linearly increases the sequential computation on GPUs; (2) stochastic and dynamic online workloads cause extra costs from batching and paddings. Therefore, we present \sys for the real-world setting of GPU inference on online workloads. At its core, \sys adopts stacking distillation and boosting ensemble, distilling the original deep model into a group of shallow but virtually stacked student models running in parallel. This enables \sys to achieve a lower model depth (e.g., two layers) than the others and the lowest inference latency while maintaining accuracy. In addition, adaptive student pruning realizes dynamic student numbers according to changing online workloads. Especially for occasional workload bursts, it can temporarily decrease the student number with minimal accuracy loss to improve system throughput. We conduct comprehensive experiments to verify the effectiveness, whose results show that \sys outperforms the baselines by $4.1\times\sim 1.6\times$ in latency while maintaining accuracy and achieves up to $22.27\times$ higher throughput for workload bursts.

Related papers

Over-parameterized Student Model via Tensor Decomposition Boosted Knowledge Distillation [10.48108719012248]
We focus on Knowledge Distillation (KD), where a compact student model is trained to mimic a larger teacher model. In contrast to much of the previous work, we scale up the parameters of the student model during training.
arXiv Detail & Related papers (2024-11-10T12:40:59Z)
ConServe: Harvesting GPUs for Low-Latency and High-Throughput Large Language Model Serving [15.01982917560918]
This paper proposes to harvest stranded GPU resources for offline LLM inference tasks. We built ConServe, an LLM serving system that contains an execution engine that preempts running offline tasks. Our evaluation demonstrates that ConServe achieves strong performance isolation when co-serving online and offline tasks.
arXiv Detail & Related papers (2024-10-02T04:12:13Z)
Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them. This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model. OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z)
Harnessing Manycore Processors with Distributed Memory for Accelerated Training of Sparse and Recurrent Models [43.1773057439246]
Current AI training infrastructure is dominated by single instruction multiple data (SIMD) and systolic array architectures. We explore sparse and recurrent model training on a massively parallel multiple instruction multiple data architecture with distributed local memory.
arXiv Detail & Related papers (2023-11-07T23:18:35Z)
Latency-aware Unified Dynamic Networks for Efficient Image Recognition [72.8951331472913]
LAUDNet is a framework to bridge the theoretical and practical efficiency gap in dynamic networks. It integrates three primary dynamic paradigms-spatially adaptive computation, dynamic layer skipping, and dynamic channel skipping. It can notably reduce the latency of models like ResNet by over 50% on platforms such as V100,3090, and TX2 GPUs.
arXiv Detail & Related papers (2023-08-30T10:57:41Z)
Isolation and Induction: Training Robust Deep Neural Networks against Model Stealing Attacks [51.51023951695014]
Existing model stealing defenses add deceptive perturbations to the victim's posterior probabilities to mislead the attackers. This paper proposes Isolation and Induction (InI), a novel and effective training framework for model stealing defenses. In contrast to adding perturbations over model predictions that harm the benign accuracy, we train models to produce uninformative outputs against stealing queries.
arXiv Detail & Related papers (2023-08-02T05:54:01Z)
PiPAD: Pipelined and Parallel Dynamic GNN Training on GPUs [3.3019914257038168]
Dynamic Graph Neural Networks (DGNNs) have been broadly applied in various real-life applications, such as link prediction and pandemic forecast. DGNNs manifest substantial parallel computation and data reuse potentials, but suffer from severe memory access inefficiency and data transfer overhead. We propose PiPAD, a $underlinetextbfPipelined$ and $underlinetextbfDGNN$ training framework for the end-to-end performance optimization on GPUs.
arXiv Detail & Related papers (2023-01-01T12:10:31Z)
Directed Acyclic Graph Factorization Machines for CTR Prediction via Knowledge Distillation [65.62538699160085]
We propose a Directed Acyclic Graph Factorization Machine (KD-DAGFM) to learn the high-order feature interactions from existing complex interaction models for CTR prediction via Knowledge Distillation. KD-DAGFM achieves the best performance with less than 21.5% FLOPs of the state-of-the-art method on both online and offline experiments.
arXiv Detail & Related papers (2022-11-21T03:09:42Z)
Slimmable Networks for Contrastive Self-supervised Learning [69.9454691873866]
Self-supervised learning makes significant progress in pre-training large models, but struggles with small models. We introduce another one-stage solution to obtain pre-trained small models without the need for extra teachers. A slimmable network consists of a full network and several weight-sharing sub-networks, which can be pre-trained once to obtain various networks.
arXiv Detail & Related papers (2022-09-30T15:15:05Z)
Dynamic Contrastive Distillation for Image-Text Retrieval [90.05345397400144]
We present a novel plug-in dynamic contrastive distillation (DCD) framework to compress image-text retrieval models. We successfully apply our proposed DCD strategy to two state-of-the-art vision-language pretrained models, i.e. ViLT and METER. Experiments on MS-COCO and Flickr30K benchmarks show the effectiveness and efficiency of our DCD framework.
arXiv Detail & Related papers (2022-07-04T14:08:59Z)
MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed. We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z)
Hydra: A System for Large Multi-Model Deep Learning [3.571623412954477]
We present'model spilling', a technique aimed at models such as Transformers and CNNs to move groups of layers between DRAM and GPU memory. We then present a set of novel techniques leveraging spilling to raise efficiency for multi-model training workloads. Experiments with real benchmark workloads show that HYDRA is over 7x faster than regular model parallelism and over 50% faster than state-of-the-art industrial tools for pipeline parallelism.
arXiv Detail & Related papers (2021-10-16T18:13:57Z)
Accelerating Training and Inference of Graph Neural Networks with Fast Sampling and Pipelining [58.10436813430554]
Mini-batch training of graph neural networks (GNNs) requires a lot of computation and data movement. We argue in favor of performing mini-batch training with neighborhood sampling in a distributed multi-GPU environment. We present a sequence of improvements to mitigate these bottlenecks, including a performance-engineered neighborhood sampler. We also conduct an empirical analysis that supports the use of sampling for inference, showing that test accuracies are not materially compromised.
arXiv Detail & Related papers (2021-10-16T02:41:35Z)
Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous Multi-GPU Servers [65.60007071024629]
We show that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy. We show experimentally that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
arXiv Detail & Related papers (2021-10-13T20:58:15Z)
Self-Damaging Contrastive Learning [92.34124578823977]
Unlabeled data in reality is commonly imbalanced and shows a long-tail distribution. This paper proposes a principled framework called Self-Damaging Contrastive Learning to automatically balance the representation learning without knowing the classes. Our experiments show that SDCLR significantly improves not only overall accuracies but also balancedness.
arXiv Detail & Related papers (2021-06-06T00:04:49Z)
ScaleFreeCTR: MixCache-based Distributed Training System for CTR Models with Huge Embedding Table [23.264897780201316]
Various deep Click-Through Rate (CTR) models are deployed in the commercial systems by industrial companies. To achieve better performance, it is necessary to train the deep CTR models on huge volume of training data efficiently. We propose the ScaleFreeCTR: a MixCache-based distributed training system for CTR models.
arXiv Detail & Related papers (2021-04-17T13:36:19Z)
Accelerating Sparse Deep Neural Networks [20.6942347219753]
We present the design and behavior of Sparse Cores, which exploit a 2:4 (25%) sparsity pattern that leads to twice the math throughput of dense matrix units. We also describe a simple workflow for training networks that both satisfy the 2:4 sparsity pattern requirements and maintain accuracy.
arXiv Detail & Related papers (2021-04-16T21:27:32Z)
DiPair: Fast and Accurate Distillation for Trillion-Scale Text Matching and Pair Modeling [24.07558669713062]
We propose DiPair -- a framework for distilling fast and accurate models on text pair tasks. It is both highly scalable and offers improved quality-speed tradeoffs. Empirical studies conducted on both academic and real-world e-commerce benchmarks demonstrate the efficacy of the proposed approach.
arXiv Detail & Related papers (2020-10-07T01:19:23Z)
LightPAFF: A Two-Stage Distillation Framework for Pre-training and Fine-tuning [146.51221523793342]
LightPAFF uses two-stage knowledge distillation to transfer knowledge from a big teacher model to a lightweight student model. LightPAFF reduces the model size by nearly 5x and improves online inference speed by 5x-7x.
arXiv Detail & Related papers (2020-04-27T14:00:09Z)
Efficient Crowd Counting via Structured Knowledge Transfer [122.30417437707759]
Crowd counting is an application-oriented task and its inference efficiency is crucial for real-world applications. We propose a novel Structured Knowledge Transfer framework to generate a lightweight but still highly effective student network. Our models obtain at least 6.5$times$ speed-up on an Nvidia 1080 GPU and even achieve state-of-the-art performance.
arXiv Detail & Related papers (2020-03-23T08:05:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.