What do we expect from Multiple-choice QA Systems?
- URL: http://arxiv.org/abs/2011.10647v1
- Date: Fri, 20 Nov 2020 21:27:10 GMT
- Title: What do we expect from Multiple-choice QA Systems?
- Authors: Krunal Shah, Nitish Gupta, Dan Roth
- Abstract summary: We consider a top performing model on several Multiple Choice Question Answering (MCQA) datasets.
We evaluate it against a set of expectations one might have from such a model, using a series of zero-information perturbations of the model's inputs.
- Score: 70.86513724662302
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recent success of machine learning systems on various QA datasets could
be interpreted as a significant improvement in models' language understanding
abilities. However, using various perturbations, multiple recent works have
shown that good performance on a dataset might not indicate performance that
correlates well with human's expectations from models that "understand"
language. In this work we consider a top performing model on several Multiple
Choice Question Answering (MCQA) datasets, and evaluate it against a set of
expectations one might have from such a model, using a series of
zero-information perturbations of the model's inputs. Our results show that the
model clearly falls short of our expectations, and motivates a modified
training approach that forces the model to better attend to the inputs. We show
that the new training paradigm leads to a model that performs on par with the
original model while better satisfying our expectations.
Related papers
- Attribute-to-Delete: Machine Unlearning via Datamodel Matching [65.13151619119782]
Machine unlearning -- efficiently removing a small "forget set" training data on a pre-divertrained machine learning model -- has recently attracted interest.
Recent research shows that machine unlearning techniques do not hold up in such a challenging setting.
arXiv Detail & Related papers (2024-10-30T17:20:10Z) - Learn What Is Possible, Then Choose What Is Best: Disentangling
One-To-Many Relations in Language Through Text-based Games [3.615981646205045]
We present an approach to train language models that can emulate the desirable behaviours, but not the undesirable ones.
Using text-based games as a testbed, our approach, PASA, uses discrete latent variables to capture the range of different behaviours.
Results show up to 49% empirical improvement over the previous state-of-the-art model.
arXiv Detail & Related papers (2023-04-14T17:11:26Z) - PAMI: partition input and aggregate outputs for model interpretation [69.42924964776766]
In this study, a simple yet effective visualization framework called PAMI is proposed based on the observation that deep learning models often aggregate features from local regions for model predictions.
The basic idea is to mask majority of the input and use the corresponding model output as the relative contribution of the preserved input part to the original model prediction.
Extensive experiments on multiple tasks confirm the proposed method performs better than existing visualization approaches in more precisely finding class-specific input regions.
arXiv Detail & Related papers (2023-02-07T08:48:34Z) - Dataless Knowledge Fusion by Merging Weights of Language Models [51.8162883997512]
Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models.
This creates a barrier to fusing knowledge across individual models to yield a better single model.
We propose a dataless knowledge fusion method that merges models in their parameter space.
arXiv Detail & Related papers (2022-12-19T20:46:43Z) - Composing Ensembles of Pre-trained Models via Iterative Consensus [95.10641301155232]
We propose a unified framework for composing ensembles of different pre-trained models.
We use pre-trained models as "generators" or "scorers" and compose them via closed-loop iterative consensus optimization.
We demonstrate that consensus achieved by an ensemble of scorers outperforms the feedback of a single scorer.
arXiv Detail & Related papers (2022-10-20T18:46:31Z) - A Comparative Study of Transformer-Based Language Models on Extractive
Question Answering [0.5079811885340514]
We train various pre-trained language models and fine-tune them on multiple question answering datasets.
Using the F1-score as our metric, we find that the RoBERTa and BART pre-trained models perform the best across all datasets.
arXiv Detail & Related papers (2021-10-07T02:23:19Z) - Comparing Test Sets with Item Response Theory [53.755064720563]
We evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples.
We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models.
We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.
arXiv Detail & Related papers (2021-06-01T22:33:53Z) - A Simple and Interpretable Predictive Model for Healthcare [0.0]
Deep learning models are currently dominating most state-of-the-art solutions for disease prediction.
These deep learning models, with trainable parameters running into millions, require huge amounts of compute and data to train and deploy.
We develop a simpler yet interpretable non-deep learning based model for application to EHR data.
arXiv Detail & Related papers (2020-07-27T08:13:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.