An Analysis of Dataset Overlap on Winograd-Style Tasks
- URL: http://arxiv.org/abs/2011.04767v1
- Date: Mon, 9 Nov 2020 21:11:17 GMT
- Title: An Analysis of Dataset Overlap on Winograd-Style Tasks
- Authors: Ali Emami, Adam Trischler, Kaheer Suleman and Jackie Chi Kit Cheung
- Abstract summary: We analyze the effects of varying degrees of overlap between training corpora and test instances in WSC-style tasks.
KnowRef-60K is the largest corpus to date for WSC-style common-sense reasoning.
- Score: 40.27778524078
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Winograd Schema Challenge (WSC) and variants inspired by it have become
important benchmarks for common-sense reasoning (CSR). Model performance on the
WSC has quickly progressed from chance-level to near-human using neural
language models trained on massive corpora. In this paper, we analyze the
effects of varying degrees of overlap between these training corpora and the
test instances in WSC-style tasks. We find that a large number of test
instances overlap considerably with the corpora on which state-of-the-art
models are (pre)trained, and that a significant drop in classification accuracy
occurs when we evaluate models on instances with minimal overlap. Based on
these results, we develop the KnowRef-60K dataset, which consists of over 60k
pronoun disambiguation problems scraped from web data. KnowRef-60K is the
largest corpus to date for WSC-style common-sense reasoning and exhibits a
significantly lower proportion of overlaps with current pretraining corpora.
Related papers
- Trapped in texture bias? A large scale comparison of deep instance
segmentation [4.2603120588176635]
We evaluate 68 models on 61 versions of MS COCO for a total of 4148 evaluations.
We find that YOLACT++, SOTR and SOLOv2 are significantly more robust to out-of-distribution texture than other frameworks.
arXiv Detail & Related papers (2024-01-17T10:21:08Z) - The Languini Kitchen: Enabling Language Modelling Research at Different
Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours.
We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length.
This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z) - A comprehensive comparative evaluation and analysis of Distributional
Semantic Models [61.41800660636555]
We perform a comprehensive evaluation of type distributional vectors, either produced by static DSMs or obtained by averaging the contextualized vectors generated by BERT.
The results show that the alleged superiority of predict based models is more apparent than real, and surely not ubiquitous.
We borrow from cognitive neuroscience the methodology of Representational Similarity Analysis (RSA) to inspect the semantic spaces generated by distributional models.
arXiv Detail & Related papers (2021-05-20T15:18:06Z) - Model Rectification via Unknown Unknowns Extraction from Deployment
Samples [8.0497115494227]
We propose a general algorithmic framework that aims to perform a post-training model rectification at deployment time in a supervised way.
RTSCV extracts unknown unknowns (u.u.s)
We show that RTSCV consistently outperforms state-of-the-art approaches.
arXiv Detail & Related papers (2021-02-08T11:46:19Z) - RethinkCWS: Is Chinese Word Segmentation a Solved Task? [81.11161697133095]
The performance of the Chinese Word (CWS) systems has gradually reached a plateau with the rapid development of deep neural networks.
In this paper, we take stock of what we have achieved and rethink what's left in the CWS task.
arXiv Detail & Related papers (2020-11-13T11:07:08Z) - The Devil is in Classification: A Simple Framework for Long-tail Object
Detection and Instance Segmentation [93.17367076148348]
We investigate performance drop of the state-of-the-art two-stage instance segmentation model Mask R-CNN on the recent long-tail LVIS dataset.
We unveil that a major cause is the inaccurate classification of object proposals.
We propose a simple calibration framework to more effectively alleviate classification head bias with a bi-level class balanced sampling approach.
arXiv Detail & Related papers (2020-07-23T12:49:07Z) - Overcoming Classifier Imbalance for Long-tail Object Detection with
Balanced Group Softmax [88.11979569564427]
We provide the first systematic analysis on the underperformance of state-of-the-art models in front of long-tail distribution.
We propose a novel balanced group softmax (BAGS) module for balancing the classifiers within the detection frameworks through group-wise training.
Extensive experiments on the very recent long-tail large vocabulary object recognition benchmark LVIS show that our proposed BAGS significantly improves the performance of detectors.
arXiv Detail & Related papers (2020-06-18T10:24:26Z) - Adaptive Name Entity Recognition under Highly Unbalanced Data [5.575448433529451]
We present our experiments on a neural architecture composed of a Conditional Random Field (CRF) layer stacked on top of a Bi-directional LSTM (BI-LSTM) layer for solving NER tasks.
We introduce an add-on classification model to split sentences into two different sets: Weak and Strong classes and then designing a couple of Bi-LSTM-CRF models properly to optimize performance on each set.
arXiv Detail & Related papers (2020-03-10T06:56:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.