Build a Robust QA System with Transformer-based Mixture of Experts
- URL: http://arxiv.org/abs/2204.09598v1
- Date: Sun, 20 Mar 2022 02:38:29 GMT
- Title: Build a Robust QA System with Transformer-based Mixture of Experts
- Authors: Yu Qing Zhou, Xixuan Julie Liu, Yuanzhe Dong
- Abstract summary: We build a robust question answering system that can adapt to out-of-domain datasets.
We show that our combination of best architecture and data augmentation techniques achieves a 53.477 F1 score in the out-of-domain evaluation.
- Score: 0.29005223064604074
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we aim to build a robust question answering system that can
adapt to out-of-domain datasets. A single network may overfit to the
superficial correlation in the training distribution, but with a meaningful
number of expert sub-networks, a gating network that selects a sparse
combination of experts for each input, and careful balance on the importance of
expert sub-networks, the Mixture-of-Experts (MoE) model allows us to train a
multi-task learner that can be generalized to out-of-domain datasets. We also
explore the possibility of bringing the MoE layers up to the middle of the
DistilBERT and replacing the dense feed-forward network with a
sparsely-activated switch FFN layers, similar to the Switch Transformer
architecture, which simplifies the MoE routing algorithm with reduced
communication and computational costs. In addition to model architectures, we
explore techniques of data augmentation including Easy Data Augmentation (EDA)
and back translation, to create more meaningful variance among the small
out-of-domain training data, therefore boosting the performance and robustness
of our models. In this paper, we show that our combination of best architecture
and data augmentation techniques achieves a 53.477 F1 score in the
out-of-domain evaluation, which is a 9.52% performance gain over the baseline.
On the final test set, we reported a higher 59.506 F1 and 41.651 EM. We
successfully demonstrate the effectiveness of Mixture-of-Expert architecture in
a Robust QA task.
Related papers
- Layerwise Recurrent Router for Mixture-of-Experts [42.36093735411238]
Mixture-of-Experts (MoE) architecture stands out for its ability to scale model size without significantly increasing training costs.
Current MoE models often display parameter inefficiency.
We introduce the Layerwise Recurrent Router for Mixture-of-Experts (RMoE)
arXiv Detail & Related papers (2024-08-13T10:25:13Z) - Transformer-based Federated Learning for Multi-Label Remote Sensing Image Classification [2.3255040478777755]
We investigate the capability of state-of-the-art transformer architectures to address the challenges related to non-IID training data across various clients.
The considered transformer architectures increase the ability with the cost of higher local training and aggregation complexities.
arXiv Detail & Related papers (2024-05-24T10:13:49Z) - Mechanistic Design and Scaling of Hybrid Architectures [114.3129802943915]
We identify and test new hybrid architectures constructed from a variety of computational primitives.
We experimentally validate the resulting architectures via an extensive compute-optimal and a new state-optimal scaling law analysis.
We find MAD synthetics to correlate with compute-optimal perplexity, enabling accurate evaluation of new architectures.
arXiv Detail & Related papers (2024-03-26T16:33:12Z) - Efficient Deep Spiking Multi-Layer Perceptrons with Multiplication-Free Inference [13.924924047051782]
Deep convolution architectures for Spiking Neural Networks (SNNs) have significantly enhanced image classification performance and reduced computational burdens.
This research explores a new pathway, drawing inspiration from the progress made in Multi-Layer Perceptrons (MLPs)
We propose an innovative spiking architecture that uses batch normalization to retain MFI compatibility.
We establish an efficient multi-stage spiking network that blends effectively global receptive fields with local feature extraction.
arXiv Detail & Related papers (2023-06-21T16:52:20Z) - DA-VEGAN: Differentiably Augmenting VAE-GAN for microstructure
reconstruction from extremely small data sets [110.60233593474796]
DA-VEGAN is a model with two central innovations.
A $beta$-variational autoencoder is incorporated into a hybrid GAN architecture.
A custom differentiable data augmentation scheme is developed specifically for this architecture.
arXiv Detail & Related papers (2023-02-17T08:49:09Z) - Semantic-aware Modular Capsule Routing for Visual Question Answering [55.03883681191765]
We propose a Semantic-aware modUlar caPsulE framework, termed as SUPER, to better capture the instance-specific vision-semantic characteristics.
We comparatively justify the effectiveness and generalization ability of our proposed SUPER scheme over five benchmark datasets.
arXiv Detail & Related papers (2022-07-21T10:48:37Z) - Supernet Training for Federated Image Classification under System
Heterogeneity [15.2292571922932]
In this work, we propose a novel framework to consider both scenarios, namely Federation of Supernet Training (FedSup)
It is inspired by how averaging parameters in the model aggregation stage of Federated Learning (FL) is similar to weight-sharing in supernet training.
Under our framework, we present an efficient algorithm (E-FedSup) by sending the sub-model to clients in the broadcast stage for reducing communication costs and training overhead.
arXiv Detail & Related papers (2022-06-03T02:21:01Z) - Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained
Language Models [68.9288651177564]
We present a novel MoE architecture based on matrix product operators (MPO) from quantum many-body physics.
With the decomposed MPO structure, we can reduce the parameters of the original MoE architecture.
Experiments on the three well-known downstream natural language datasets based on GPT2 show improved performance and efficiency in increasing model capacity.
arXiv Detail & Related papers (2022-03-02T13:44:49Z) - Edge-assisted Democratized Learning Towards Federated Analytics [67.44078999945722]
We show the hierarchical learning structure of the proposed edge-assisted democratized learning mechanism, namely Edge-DemLearn.
We also validate Edge-DemLearn as a flexible model training mechanism to build a distributed control and aggregation methodology in regions.
arXiv Detail & Related papers (2020-12-01T11:46:03Z) - Wide-band butterfly network: stable and efficient inversion via
multi-frequency neural networks [1.2891210250935143]
We introduce an end-to-end deep learning architecture called the wide-band butterfly network (WideBNet) for approximating the inverse scattering map from wide-band scattering data.
This architecture incorporates tools from computational harmonic analysis, such as the butterfly factorization, and traditional multi-scale methods, such as the Cooley-Tukey FFT algorithm.
arXiv Detail & Related papers (2020-11-24T21:48:43Z) - Fitting the Search Space of Weight-sharing NAS with Graph Convolutional
Networks [100.14670789581811]
We train a graph convolutional network to fit the performance of sampled sub-networks.
With this strategy, we achieve a higher rank correlation coefficient in the selected set of candidates.
arXiv Detail & Related papers (2020-04-17T19:12:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.