Back to Square One: Bias Detection, Training and Commonsense
Disentanglement in the Winograd Schema
- URL: http://arxiv.org/abs/2104.08161v1
- Date: Fri, 16 Apr 2021 15:17:23 GMT
- Title: Back to Square One: Bias Detection, Training and Commonsense
Disentanglement in the Winograd Schema
- Authors: Yanai Elazar, Hongming Zhang, Yoav Goldberg, Dan Roth
- Abstract summary: The Winograd (WS) has been proposed as a test for measuring commonsense capabilities of models.
We show that the current evaluation method of WS is sub-optimal and propose a modification that makes use of twin sentences for evaluation.
We conclude that much of the apparent progress on WS may not necessarily reflect progress in commonsense reasoning.
- Score: 106.79804048131253
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Winograd Schema (WS) has been proposed as a test for measuring
commonsense capabilities of models. Recently, pre-trained language model-based
approaches have boosted performance on some WS benchmarks but the source of
improvement is still not clear. We begin by showing that the current evaluation
method of WS is sub-optimal and propose a modification that makes use of twin
sentences for evaluation. We also propose two new baselines that indicate the
existence of biases in WS benchmarks. Finally, we propose a method for
evaluating WS-like sentences in a zero-shot setting and observe that popular
language models perform randomly in this setting. We conclude that much of the
apparent progress on WS may not necessarily reflect progress in commonsense
reasoning, but much of it comes from supervised data, which is not likely to
account for all the required commonsense reasoning skills and knowledge.
Related papers
- Advancing Cross-Domain Generalizability in Face Anti-Spoofing: Insights, Design, and Metrics [10.631157315662607]
This paper presents a novel perspective for enhancing anti-spoofing performance in zero-shot data domain generalization.
One step forward to the previous frame-wise spoofing prediction, we introduce a nuanced metric calculation that aggregates frame-level probabilities for a video-wise prediction.
Our final model outperforms existing state-of-the-art methods across the datasets.
arXiv Detail & Related papers (2024-06-18T04:15:22Z) - RDumb: A simple approach that questions our progress in continual test-time adaptation [12.374649969346441]
Test-Time Adaptation (TTA) allows to update pre-trained models to changing data distributions at deployment time.
Recent work proposed and applied methods for continual adaptation over long timescales.
We find that eventually all but one state-of-the-art methods collapse and perform worse than a non-adapting model.
arXiv Detail & Related papers (2023-06-08T17:52:34Z) - Preserving Knowledge Invariance: Rethinking Robustness Evaluation of
Open Information Extraction [50.62245481416744]
We present the first benchmark that simulates the evaluation of open information extraction models in the real world.
We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique.
By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques.
arXiv Detail & Related papers (2023-05-23T12:05:09Z) - To Adapt or to Annotate: Challenges and Interventions for Domain
Adaptation in Open-Domain Question Answering [46.403929561360485]
We study end-to-end model performance of open-domain question answering (ODQA)
We find that not only do models fail to generalize, but high retrieval scores often still yield poor answer prediction accuracy.
We propose and evaluate several intervention methods which improve end-to-end answer F1 score by up to 24 points.
arXiv Detail & Related papers (2022-12-20T16:06:09Z) - A Closer Look at Debiased Temporal Sentence Grounding in Videos:
Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video.
Recent studies have found that current benchmark datasets may have obvious moment annotation biases.
We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z) - WRENCH: A Comprehensive Benchmark for Weak Supervision [66.82046201714766]
benchmark consists of 22 varied real-world datasets for classification and sequence tagging.
We use benchmark to conduct extensive comparisons over more than 100 method variants to demonstrate its efficacy as a benchmark platform.
arXiv Detail & Related papers (2021-09-23T13:47:16Z) - RethinkCWS: Is Chinese Word Segmentation a Solved Task? [81.11161697133095]
The performance of the Chinese Word (CWS) systems has gradually reached a plateau with the rapid development of deep neural networks.
In this paper, we take stock of what we have achieved and rethink what's left in the CWS task.
arXiv Detail & Related papers (2020-11-13T11:07:08Z) - Pre-training Is (Almost) All You Need: An Application to Commonsense
Reasoning [61.32992639292889]
Fine-tuning of pre-trained transformer models has become the standard approach for solving common NLP tasks.
We introduce a new scoring method that casts a plausibility ranking task in a full-text format.
We show that our method provides a much more stable training phase across random restarts.
arXiv Detail & Related papers (2020-04-29T10:54:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.