Sample-based Dynamic Hierarchical Transformer with Layer and Head
Flexibility via Contextual Bandit
- URL: http://arxiv.org/abs/2312.03038v3
- Date: Wed, 10 Jan 2024 06:08:41 GMT
- Title: Sample-based Dynamic Hierarchical Transformer with Layer and Head
Flexibility via Contextual Bandit
- Authors: Fanfei Meng, Lele Zhang, Yu Chen, Yuxin Wang
- Abstract summary: Transformer requires a fixed number of layers and heads which makes them inflexible to the complexity of individual samples.
We propose a sample-based Dynamic Hierarchical Transformer model whose layers and heads can be dynamically configured with single data samples.
We achieve up to 74% computational savings for both training and inference with a minimal loss of accuracy.
- Score: 24.78757412559944
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer requires a fixed number of layers and heads which makes them
inflexible to the complexity of individual samples and expensive in training
and inference. To address this, we propose a sample-based Dynamic Hierarchical
Transformer (DHT) model whose layers and heads can be dynamically configured
with single data samples via solving contextual bandit problems. To determine
the number of layers and heads, we use the Uniform Confidence Bound while we
deploy combinatorial Thompson Sampling in order to select specific head
combinations given their number. Different from previous work that focuses on
compressing trained networks for inference only, DHT is not only advantageous
for adaptively optimizing the underlying network architecture during training
but also has a flexible network for efficient inference. To the best of our
knowledge, this is the first comprehensive data-driven dynamic transformer
without any additional auxiliary neural networks that implement the dynamic
system. According to the experiment results, we achieve up to 74% computational
savings for both training and inference with a minimal loss of accuracy.
Related papers
- Transformers Provably Learn Sparse Token Selection While Fully-Connected Nets Cannot [50.16171384920963]
transformer architecture has prevailed in various deep learning settings.
One-layer transformer trained with gradient descent provably learns the sparse token selection task.
arXiv Detail & Related papers (2024-06-11T02:15:53Z) - Towards Optimal Customized Architecture for Heterogeneous Federated
Learning with Contrastive Cloud-Edge Model Decoupling [20.593232086762665]
Federated learning, as a promising distributed learning paradigm, enables collaborative training of a global model across multiple network edge clients without the need for central data collecting.
We propose a novel federated learning framework called FedCMD, a model decoupling tailored to the Cloud-edge supported federated learning.
Our motivation is that, by the deep investigation of the performance of selecting different neural network layers as the personalized head, we found rigidly assigning the last layer as the personalized head in current studies is not always optimal.
arXiv Detail & Related papers (2024-03-04T05:10:28Z) - Dynamic Layer Tying for Parameter-Efficient Transformers [65.268245109828]
We employ Reinforcement Learning to select layers during training and tie them together.
This facilitates weight sharing, reduces the number of trainable parameters, and also serves as an effective regularization technique.
In particular, the memory consumption during training is up to one order of magnitude less than the conventional training method.
arXiv Detail & Related papers (2024-01-23T14:53:20Z) - Hierarchical Over-the-Air FedGradNorm [50.756991828015316]
Multi-task learning (MTL) is a learning paradigm to learn multiple related tasks simultaneously with a single shared network.
We propose hierarchical over-the-air (HOTA) PFL with a dynamic weighting strategy which we call HOTA-FedGradNorm.
arXiv Detail & Related papers (2022-12-14T18:54:46Z) - Predictive Coding beyond Gaussian Distributions [38.51699576854394]
Predictive coding (PC) is a neuroscience-inspired method that performs inference on hierarchical Gaussian generative models.
These methods fail to keep up with modern neural networks, as they are unable to replicate the dynamics of complex layers and activation functions.
We show that our method allows us to train transformer networks and achieve a performance comparable with BP on conditional language models.
arXiv Detail & Related papers (2022-11-07T12:02:05Z) - Arbitrary Bit-width Network: A Joint Layer-Wise Quantization and
Adaptive Inference Approach [38.03309300383544]
We propose to feed different data samples with varying quantization schemes to achieve a data-dependent dynamic inference, at a fine-grained layer level.
We present the Arbitrary Bit-width Network (ABN), where the bit-widths of a single deep network can change at runtime for different data samples, with a layer-wise granularity.
On ImageNet classification, we achieve 1.1% top1 accuracy improvement while saving 36.2% BitOps.
arXiv Detail & Related papers (2022-04-21T09:36:43Z) - HyperTransformer: Model Generation for Supervised and Semi-Supervised
Few-Shot Learning [14.412066456583917]
We propose a transformer-based model for few-shot learning that generates weights of a convolutional neural network (CNN) directly from support samples.
Our method is particularly effective for small target CNN architectures where learning a fixed universal task-independent embedding is not optimal.
We extend our approach to a semi-supervised regime utilizing unlabeled samples in the support set and further improving few-shot performance.
arXiv Detail & Related papers (2022-01-11T20:15:35Z) - Model Fusion of Heterogeneous Neural Networks via Cross-Layer Alignment [17.735593218773758]
We propose a novel model fusion framework, named CLAFusion, to fuse neural networks with a different number of layers.
Based on the cross-layer alignment, our framework balances the number of layers of neural networks before applying layer-wise model fusion.
arXiv Detail & Related papers (2021-10-29T05:02:23Z) - Shape Adaptor: A Learnable Resizing Module [59.940372879848624]
We present a novel resizing module for neural networks: shape adaptor, a drop-in enhancement built on top of traditional resizing layers.
Our implementation enables shape adaptors to be trained end-to-end without any additional supervision.
We show the effectiveness of shape adaptors on two other applications: network compression and transfer learning.
arXiv Detail & Related papers (2020-08-03T14:15:52Z) - Pre-Trained Models for Heterogeneous Information Networks [57.78194356302626]
We propose a self-supervised pre-training and fine-tuning framework, PF-HIN, to capture the features of a heterogeneous information network.
PF-HIN consistently and significantly outperforms state-of-the-art alternatives on each of these tasks, on four datasets.
arXiv Detail & Related papers (2020-07-07T03:36:28Z) - Fitting the Search Space of Weight-sharing NAS with Graph Convolutional
Networks [100.14670789581811]
We train a graph convolutional network to fit the performance of sampled sub-networks.
With this strategy, we achieve a higher rank correlation coefficient in the selected set of candidates.
arXiv Detail & Related papers (2020-04-17T19:12:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.