Ensemble ALBERT on SQuAD 2.0
        - URL: http://arxiv.org/abs/2110.09665v1
- Date: Tue, 19 Oct 2021 00:15:19 GMT
- Title: Ensemble ALBERT on SQuAD 2.0
- Authors: Shilun Li, Renee Li, Veronica Peng
- Abstract summary: In our Paper, we utilize the fine-tuned ALBERT models and implement combinations of additional layers to improve model performance.
Our best-performing individual model is ALBERT-xxlarge + ALBERT-SQuAD-out, which achieved an F1 score of 88.435 on the dev set.
By passing in several best-performing models' results into our weighted voting ensemble algorithm, our final result ranks first on the Stanford CS224N Test PCE SQuAD Leaderboard with F1 = 90.123.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   Machine question answering is an essential yet challenging task in natural
language processing. Recently, Pre-trained Contextual Embeddings (PCE) models
like Bidirectional Encoder Representations from Transformers (BERT) and A Lite
BERT (ALBERT) have attracted lots of attention due to their great performance
in a wide range of NLP tasks. In our Paper, we utilized the fine-tuned ALBERT
models and implemented combinations of additional layers (e.g. attention layer,
RNN layer) on top of them to improve model performance on Stanford Question
Answering Dataset (SQuAD 2.0). We implemented four different models with
different layers on top of ALBERT-base model, and two other models based on
ALBERT-xlarge and ALBERT-xxlarge. We compared their performance to our baseline
model ALBERT-base-v2 + ALBERT-SQuAD-out with details. Our best-performing
individual model is ALBERT-xxlarge + ALBERT-SQuAD-out, which achieved an F1
score of 88.435 on the dev set. Furthermore, we have implemented three
different ensemble algorithms to boost overall performance. By passing in
several best-performing models' results into our weighted voting ensemble
algorithm, our final result ranks first on the Stanford CS224N Test PCE SQuAD
Leaderboard with F1 = 90.123.
 
      
        Related papers
        - Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data   Synthesis and Self-Correction [95.91743732150233]
 Goedel-Prover-V2, a series of open-source language models, set a new state-of-the-art in automated theorem proving.<n>We generate synthetic tasks of increasing difficulty to train the model to master increasingly complex theorems.<n>Goedel-Prover-V2-32B achieves 88.1% on MiniF2F at pass@32 in standard mode and 90.4% in self-correction mode.
 arXiv  Detail & Related papers  (2025-08-05T16:28:22Z)
- oBERTa: Improving Sparse Transfer Learning via improved initialization,
  distillation, and pruning regimes [82.99830498937729]
 oBERTa is an easy-to-use set of language models for Natural Language Processing.
It allows NLP practitioners to obtain between 3.8 and 24.3 times faster models without expertise in model compression.
We explore the use of oBERTa on seven representative NLP tasks.
 arXiv  Detail & Related papers  (2023-03-30T01:37:19Z)
- FlexiBERT: Are Current Transformer Architectures too Homogeneous and
  Rigid? [7.813154720635396]
 We propose a suite of heterogeneous and flexible models, namely FlexiBERT, that have varied encoder layers with a diverse set of possible operations.
We also propose a novel NAS policy, called BOSHNAS, that leverages this new scheme, Bayesian modeling, and second-order optimization.
A comprehensive set of experiments shows that the proposed policy, when applied to the FlexiBERT design space, pushes the performance frontier upwards compared to traditional models.
 arXiv  Detail & Related papers  (2022-05-23T22:44:34Z)
- DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with
  Gradient-Disentangled Embedding Sharing [117.41016786835452]
 This paper presents a new pre-trained language model, DeBERTaV3, which improves the original DeBERTa model.
 vanilla embedding sharing in ELECTRA hurts training efficiency and model performance.
We propose a new gradient-disentangled embedding sharing method that avoids the tug-of-war dynamics.
 arXiv  Detail & Related papers  (2021-11-18T06:48:00Z)
- BERMo: What can BERT learn from ELMo? [6.417011237981518]
 We use linear combination scheme proposed in Embeddings from Language Models (ELMo) to combine the scaled internal representations from different network depths.
Our approach has two-fold benefits: (1) improved gradient flow for the downstream task and (2) increased representative power.
 arXiv  Detail & Related papers  (2021-10-18T17:35:41Z)
- AutoBERT-Zero: Evolving BERT Backbone from Scratch [94.89102524181986]
 We propose an Operation-Priority Neural Architecture Search (OP-NAS) algorithm to automatically search for promising hybrid backbone architectures.
We optimize both the search algorithm and evaluation of candidate models to boost the efficiency of our proposed OP-NAS.
Experiments show that the searched architecture (named AutoBERT-Zero) significantly outperforms BERT and its variants of different model capacities in various downstream tasks.
 arXiv  Detail & Related papers  (2021-07-15T16:46:01Z)
- LV-BERT: Exploiting Layer Variety for BERT [85.27287501885807]
 We introduce convolution into the layer type set, which is experimentally found beneficial to pre-trained models.
We then adopt an evolutionary algorithm guided by pre-training accuracy to find the optimal architecture.
 LV-BERT model obtained by our method outperforms BERT and its variants on various downstream tasks.
 arXiv  Detail & Related papers  (2021-06-22T13:20:14Z)
- ConvBERT: Improving BERT with Span-based Dynamic Convolution [144.25748617961082]
 BERT heavily relies on the global self-attention block and thus suffers large memory footprint and computation cost.
We propose a novel span-based dynamic convolution to replace these self-attention heads to directly model local dependencies.
The novel convolution heads, together with the rest self-attention heads, form a new mixed attention block that is more efficient at both global and local context learning.
 arXiv  Detail & Related papers  (2020-08-06T07:43:19Z)
- DeBERTa: Decoding-enhanced BERT with Disentangled Attention [119.77305080520718]
 We propose a new model architecture DeBERTa that improves the BERT and RoBERTa models using two novel techniques.
We show that these techniques significantly improve the efficiency of model pre-training and the performance of both natural language understanding (NLU) and natural langauge generation (NLG) downstream tasks.
 arXiv  Detail & Related papers  (2020-06-05T19:54:34Z)
- Gestalt: a Stacking Ensemble for SQuAD2.0 [0.0]
 We propose a deep-learning system that finds, or indicates the lack of, a correct answer to a question in a context paragraph.
Our goal is to learn an ensemble of heterogeneous SQuAD2.0 models that outperforms the best model in the ensemble per se.
 arXiv  Detail & Related papers  (2020-04-02T08:09:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.