Ensemble Transformer for Efficient and Accurate Ranking Tasks: an
Application to Question Answering Systems
- URL: http://arxiv.org/abs/2201.05767v1
- Date: Sat, 15 Jan 2022 06:21:01 GMT
- Title: Ensemble Transformer for Efficient and Accurate Ranking Tasks: an
Application to Question Answering Systems
- Authors: Yoshitomo Matsubara, Luca Soldaini, Eric Lind, Alessandro Moschitti
- Abstract summary: We propose a neural network designed to distill an ensemble of large transformers into a single smaller model.
An MHS model consists of two components: a stack of transformer layers that is used to encode inputs, and a set of ranking heads.
Unlike traditional distillation techniques, our approach leverages individual models in ensemble as teachers in a way that preserves the diversity of the ensemble members.
- Score: 99.13795374152997
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large transformer models can highly improve Answer Sentence Selection (AS2)
task, but their high computational costs prevent their use in many real world
applications. In this paper, we explore the following research question: How
can we make the AS2models more accurate without significantly increasing their
model complexity? To address the question, we propose a Multiple Heads Student
architecture (MHS), an efficient neural network designed to distill an ensemble
of large transformers into a single smaller model. An MHS model consists of two
components: a stack of transformer layers that is used to encode inputs, and a
set of ranking heads; each of them is trained by distilling a different large
transformer architecture. Unlike traditional distillation techniques, our
approach leverages individual models in ensemble as teachers in a way that
preserves the diversity of the ensemble members. The resulting model captures
the knowledge of different types of transformer models by using just a few
extra parameters. We show the effectiveness of MHS on three English datasets
for AS2; our proposed approach outperforms all single-model distillations we
consider, rivaling the state-of-the-art large AS2 models that have 2.7x more
parameters and run 2.5x slower.
Related papers
- Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models [92.36510016591782]
We present a method that is able to distill a pretrained Transformer architecture into alternative architectures such as state space models (SSMs)
Our method, called MOHAWK, is able to distill a Mamba-2 variant based on the Phi-1.5 architecture using only 3B tokens and a hybrid version (Hybrid Phi-Mamba) using 5B tokens.
Despite using less than 1% of the training data typically used to train models from scratch, Phi-Mamba boasts substantially stronger performance compared to all past open-source non-Transformer models.
arXiv Detail & Related papers (2024-08-19T17:48:11Z) - Multi-Path Transformer is Better: A Case Study on Neural Machine
Translation [35.67070351304121]
We study how model width affects the Transformer model through a parameter-efficient multi-path structure.
Experiments on 12 WMT machine translation tasks show that, with the same number of parameters, the shallower multi-path model can achieve similar or even better performance than the deeper model.
arXiv Detail & Related papers (2023-05-10T07:39:57Z) - Transformers For Recognition In Overhead Imagery: A Reality Check [0.0]
We compare the impact of adding transformer structures into state-of-the-art segmentation models for overhead imagery.
Our results suggest that transformers provide consistent, but modest, performance improvements.
arXiv Detail & Related papers (2022-10-23T02:17:31Z) - AdaMix: Mixture-of-Adapter for Parameter-efficient Tuning of Large
Language Models [119.7093605087114]
Fine-tuning large-scale pre-trained language models to downstream tasks require updating hundreds of millions of parameters.
This not only increases the serving cost to store a large copy of the model weights for every task, but also exhibits instability during few-shot task adaptation.
We introduce a new mechanism to improve adapter capacity without increasing parameters or computational cost by two key techniques.
arXiv Detail & Related papers (2022-05-24T23:41:22Z) - Efficient Vision Transformers via Fine-Grained Manifold Distillation [96.50513363752836]
Vision transformer architectures have shown extraordinary performance on many computer vision tasks.
Although the network performance is boosted, transformers are often required more computational resources.
We propose to excavate useful information from the teacher transformer through the relationship between images and the divided patches.
arXiv Detail & Related papers (2021-07-03T08:28:34Z) - Stochastic Attention Head Removal: A simple and effective method for
improving Transformer Based ASR Models [40.991809705930955]
We propose to randomly remove attention heads during training and keep all attention heads at test time, thus the final model is an ensemble of models with different architectures.
Our method gives consistent performance gains over strong baselines on the Wall Street Journal, AISHELL, Switchboard and AMI datasets.
arXiv Detail & Related papers (2020-11-08T15:41:03Z) - The Cascade Transformer: an Application for Efficient Answer Sentence
Selection [116.09532365093659]
We introduce the Cascade Transformer, a technique to adapt transformer-based models into a cascade of rankers.
When compared to a state-of-the-art transformer model, our approach reduces computation by 37% with almost no impact on accuracy.
arXiv Detail & Related papers (2020-05-05T23:32:01Z) - MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression
of Pre-Trained Transformers [117.67424061746247]
We present a simple and effective approach to compress large Transformer based pre-trained models.
We propose distilling the self-attention module of the last Transformer layer of the teacher, which is effective and flexible for the student.
Experimental results demonstrate that our monolingual model outperforms state-of-the-art baselines in different parameter size of student models.
arXiv Detail & Related papers (2020-02-25T15:21:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.