Related papers: An Analysis of Dataset Overlap on Winograd-Style Tasks

An Analysis of Dataset Overlap on Winograd-Style Tasks

URL: http://arxiv.org/abs/2011.04767v1
Date: Mon, 9 Nov 2020 21:11:17 GMT
Title: An Analysis of Dataset Overlap on Winograd-Style Tasks
Authors: Ali Emami, Adam Trischler, Kaheer Suleman and Jackie Chi Kit Cheung
Abstract summary: We analyze the effects of varying degrees of overlap between training corpora and test instances in WSC-style tasks. KnowRef-60K is the largest corpus to date for WSC-style common-sense reasoning.
Score: 40.27778524078
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The Winograd Schema Challenge (WSC) and variants inspired by it have become important benchmarks for common-sense reasoning (CSR). Model performance on the WSC has quickly progressed from chance-level to near-human using neural language models trained on massive corpora. In this paper, we analyze the effects of varying degrees of overlap between these training corpora and the test instances in WSC-style tasks. We find that a large number of test instances overlap considerably with the corpora on which state-of-the-art models are (pre)trained, and that a significant drop in classification accuracy occurs when we evaluate models on instances with minimal overlap. Based on these results, we develop the KnowRef-60K dataset, which consists of over 60k pronoun disambiguation problems scraped from web data. KnowRef-60K is the largest corpus to date for WSC-style common-sense reasoning and exhibits a significantly lower proportion of overlaps with current pretraining corpora.

Related papers

Trapped in texture bias? A large scale comparison of deep instance segmentation [4.2603120588176635]
We evaluate 68 models on 61 versions of MS COCO for a total of 4148 evaluations. We find that YOLACT++, SOTR and SOLOv2 are significantly more robust to out-of-distribution texture than other frameworks.
arXiv Detail & Related papers (2024-01-17T10:21:08Z)
The Languini Kitchen: Enabling Language Modelling Research at Different Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours. We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length. This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z)
A comprehensive comparative evaluation and analysis of Distributional Semantic Models [61.41800660636555]
We perform a comprehensive evaluation of type distributional vectors, either produced by static DSMs or obtained by averaging the contextualized vectors generated by BERT. The results show that the alleged superiority of predict based models is more apparent than real, and surely not ubiquitous. We borrow from cognitive neuroscience the methodology of Representational Similarity Analysis (RSA) to inspect the semantic spaces generated by distributional models.
arXiv Detail & Related papers (2021-05-20T15:18:06Z)
Model Rectification via Unknown Unknowns Extraction from Deployment Samples [8.0497115494227]
We propose a general algorithmic framework that aims to perform a post-training model rectification at deployment time in a supervised way. RTSCV extracts unknown unknowns (u.u.s) We show that RTSCV consistently outperforms state-of-the-art approaches.
arXiv Detail & Related papers (2021-02-08T11:46:19Z)
RethinkCWS: Is Chinese Word Segmentation a Solved Task? [81.11161697133095]
The performance of the Chinese Word (CWS) systems has gradually reached a plateau with the rapid development of deep neural networks. In this paper, we take stock of what we have achieved and rethink what's left in the CWS task.
arXiv Detail & Related papers (2020-11-13T11:07:08Z)
The Devil is in Classification: A Simple Framework for Long-tail Object Detection and Instance Segmentation [93.17367076148348]
We investigate performance drop of the state-of-the-art two-stage instance segmentation model Mask R-CNN on the recent long-tail LVIS dataset. We unveil that a major cause is the inaccurate classification of object proposals. We propose a simple calibration framework to more effectively alleviate classification head bias with a bi-level class balanced sampling approach.
arXiv Detail & Related papers (2020-07-23T12:49:07Z)
Overcoming Classifier Imbalance for Long-tail Object Detection with Balanced Group Softmax [88.11979569564427]
We provide the first systematic analysis on the underperformance of state-of-the-art models in front of long-tail distribution. We propose a novel balanced group softmax (BAGS) module for balancing the classifiers within the detection frameworks through group-wise training. Extensive experiments on the very recent long-tail large vocabulary object recognition benchmark LVIS show that our proposed BAGS significantly improves the performance of detectors.
arXiv Detail & Related papers (2020-06-18T10:24:26Z)
Adaptive Name Entity Recognition under Highly Unbalanced Data [5.575448433529451]
We present our experiments on a neural architecture composed of a Conditional Random Field (CRF) layer stacked on top of a Bi-directional LSTM (BI-LSTM) layer for solving NER tasks. We introduce an add-on classification model to split sentences into two different sets: Weak and Strong classes and then designing a couple of Bi-LSTM-CRF models properly to optimize performance on each set.
arXiv Detail & Related papers (2020-03-10T06:56:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.