BERMo: What can BERT learn from ELMo?
- URL: http://arxiv.org/abs/2110.15802v1
- Date: Mon, 18 Oct 2021 17:35:41 GMT
- Title: BERMo: What can BERT learn from ELMo?
- Authors: Sangamesh Kodge and Kaushik Roy
- Abstract summary: We use linear combination scheme proposed in Embeddings from Language Models (ELMo) to combine the scaled internal representations from different network depths.
Our approach has two-fold benefits: (1) improved gradient flow for the downstream task and (2) increased representative power.
- Score: 6.417011237981518
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose BERMo, an architectural modification to BERT, which makes
predictions based on a hierarchy of surface, syntactic and semantic language
features. We use linear combination scheme proposed in Embeddings from Language
Models (ELMo) to combine the scaled internal representations from different
network depths. Our approach has two-fold benefits: (1) improved gradient flow
for the downstream task as every layer has a direct connection to the gradients
of the loss function and (2) increased representative power as the model no
longer needs to copy the features learned in the shallower layer which are
necessary for the downstream task. Further, our model has a negligible
parameter overhead as there is a single scalar parameter associated with each
layer in the network. Experiments on the probing task from SentEval dataset
show that our model performs up to $4.65\%$ better in accuracy than the
baseline with an average improvement of $2.67\%$ on the semantic tasks. When
subject to compression techniques, we find that our model enables stable
pruning for compressing small datasets like SST-2, where the BERT model
commonly diverges. We observe that our approach converges $1.67\times$ and
$1.15\times$ faster than the baseline on MNLI and QQP tasks from GLUE dataset.
Moreover, our results show that our approach can obtain better parameter
efficiency for penalty based pruning approaches on QQP task.
Related papers
- Representation Similarity: A Better Guidance of DNN Layer Sharing for Edge Computing without Training [3.792729116385123]
We propose a new model merging scheme by sharing representations at the edge, guided by representation similarity S.
We show that S is extremely highly correlated with merged model's accuracy with Pearson Correlation Coefficient |r| > 0.94 than other metrics.
arXiv Detail & Related papers (2024-10-15T03:35:54Z) - Layer-wise Model Merging for Unsupervised Domain Adaptation in Segmentation Tasks [3.776249047528669]
We leverage the abundance of freely trained models to introduce a cost-free approach to model merging.
It aims to maintain the distinctiveness of the task-specific final layers while unifying the initial layers.
This approach ensures parameter consistency across all layers, essential for boosting performance.
arXiv Detail & Related papers (2024-09-24T07:19:30Z) - Layer-wise Linear Mode Connectivity [52.6945036534469]
Averaging neural network parameters is an intuitive method for the knowledge of two independent models.
It is most prominently used in federated learning.
We analyse the performance of the models that result from averaging single, or groups.
arXiv Detail & Related papers (2023-07-13T09:39:10Z) - Adapted Multimodal BERT with Layer-wise Fusion for Sentiment Analysis [84.12658971655253]
We propose Adapted Multimodal BERT, a BERT-based architecture for multimodal tasks.
adapter adjusts the pretrained language model for the task at hand, while the fusion layers perform task-specific, layer-wise fusion of audio-visual information with textual BERT representations.
In our ablations we see that this approach leads to efficient models, that can outperform their fine-tuned counterparts and are robust to input noise.
arXiv Detail & Related papers (2022-12-01T17:31:42Z) - Parameter-Efficient Abstractive Question Answering over Tables or Text [60.86457030988444]
A long-term ambition of information seeking QA systems is to reason over multi-modal contexts and generate natural answers to user queries.
Memory intensive pre-trained language models are adapted to downstream tasks such as QA by fine-tuning the model on QA data in a specific modality like unstructured text or structured tables.
To avoid training such memory-hungry models while utilizing a uniform architecture for each modality, parameter-efficient adapters add and train small task-specific bottle-neck layers between transformer layers.
arXiv Detail & Related papers (2022-04-07T10:56:29Z) - Pruning Self-attentions into Convolutional Layers in Single Path [89.55361659622305]
Vision Transformers (ViTs) have achieved impressive performance over various computer vision tasks.
We propose Single-Path Vision Transformer pruning (SPViT) to efficiently and automatically compress the pre-trained ViTs.
Our SPViT can trim 52.0% FLOPs for DeiT-B and get an impressive 0.6% top-1 accuracy gain simultaneously.
arXiv Detail & Related papers (2021-11-23T11:35:54Z) - Learning to Generate Content-Aware Dynamic Detectors [62.74209921174237]
We introduce a newpective of designing efficient detectors, which is automatically generating sample-adaptive model architecture.
We introduce a course-to-fine strat-egy tailored for object detection to guide the learning of dynamic routing.
Experiments on MS-COCO dataset demonstrate that CADDet achieves 1.8 higher mAP with 10% fewer FLOPs compared with vanilla routing.
arXiv Detail & Related papers (2020-12-08T08:05:20Z) - Pre-Trained Models for Heterogeneous Information Networks [57.78194356302626]
We propose a self-supervised pre-training and fine-tuning framework, PF-HIN, to capture the features of a heterogeneous information network.
PF-HIN consistently and significantly outperforms state-of-the-art alternatives on each of these tasks, on four datasets.
arXiv Detail & Related papers (2020-07-07T03:36:28Z) - Training with Multi-Layer Embeddings for Model Reduction [0.9046327456472286]
We introduce a multi-layer embedding training architecture that trains embeddings via a sequence of linear layers.
We show that it allows reducing d by 4-8X, with a corresponding improvement in memory footprint, at given model accuracy.
arXiv Detail & Related papers (2020-06-10T02:47:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.