Square One Bias in NLP: Towards a Multi-Dimensional Exploration of the
Research Manifold
- URL: http://arxiv.org/abs/2206.09755v1
- Date: Mon, 20 Jun 2022 13:04:23 GMT
- Title: Square One Bias in NLP: Towards a Multi-Dimensional Exploration of the
Research Manifold
- Authors: Sebastian Ruder, Ivan Vuli\'c, Anders S{\o}gaard
- Abstract summary: We show through a manual classification of recent NLP research papers that this is indeed the case.
We observe that NLP research often goes beyond the square one setup, focusing not only on accuracy, but also on fairness or interpretability, but typically only along a single dimension.
- Score: 88.83876819883653
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The prototypical NLP experiment trains a standard architecture on labeled
English data and optimizes for accuracy, without accounting for other
dimensions such as fairness, interpretability, or computational efficiency. We
show through a manual classification of recent NLP research papers that this is
indeed the case and refer to it as the square one experimental setup. We
observe that NLP research often goes beyond the square one setup, e.g, focusing
not only on accuracy, but also on fairness or interpretability, but typically
only along a single dimension. Most work targeting multilinguality, for
example, considers only accuracy; most work on fairness or interpretability
considers only English; and so on. We show this through manual classification
of recent NLP research papers and ACL Test-of-Time award recipients. Such
one-dimensionality of most research means we are only exploring a fraction of
the NLP research search space. We provide historical and recent examples of how
the square one bias has led researchers to draw false conclusions or make
unwise choices, point to promising yet unexplored directions on the research
manifold, and make practical recommendations to enable more multi-dimensional
research. We open-source the results of our annotations to enable further
analysis at https://github.com/google-research/url-nlp
Related papers
- Fairpriori: Improving Biased Subgroup Discovery for Deep Neural Network Fairness [21.439820064223877]
This paper introduces Fairpriori, a novel biased subgroup discovery method.
It incorporates the frequent itemset generation algorithm to facilitate effective and efficient investigation of intersectional bias.
Fairpriori demonstrates superior effectiveness and efficiency when identifying intersectional bias.
arXiv Detail & Related papers (2024-06-25T00:15:13Z) - Are fairness metric scores enough to assess discrimination biases in
machine learning? [4.073786857780967]
We focus on the Bios dataset, and our learning task is to predict the occupation of individuals, based on their biography.
We address an important limitation of theoretical discussions dealing with group-wise fairness metrics: they focus on large datasets.
We then question how reliable are different popular measures of bias when the size of the training set is simply sufficient to learn reasonably accurate predictions.
arXiv Detail & Related papers (2023-06-08T15:56:57Z) - How Predictable Are Large Language Model Capabilities? A Case Study on
BIG-bench [52.11481619456093]
We study the performance prediction problem on experiment records from BIG-bench.
An $R2$ score greater than 95% indicates the presence of learnable patterns within the experiment records.
We find a subset as informative as BIG-bench Hard for evaluating new model families, while being $3times$ smaller.
arXiv Detail & Related papers (2023-05-24T09:35:34Z) - This Prompt is Measuring <MASK>: Evaluating Bias Evaluation in Language
Models [12.214260053244871]
We analyse the body of work that uses prompts and templates to assess bias in language models.
We draw on a measurement modelling framework to create a taxonomy of attributes that capture what a bias test aims to measure.
Our analysis illuminates the scope of possible bias types the field is able to measure, and reveals types that are as yet under-researched.
arXiv Detail & Related papers (2023-05-22T06:28:48Z) - Fair Enough: Standardizing Evaluation and Model Selection for Fairness
Research in NLP [64.45845091719002]
Modern NLP systems exhibit a range of biases, which a growing literature on model debiasing attempts to correct.
This paper seeks to clarify the current situation and plot a course for meaningful progress in fair learning.
arXiv Detail & Related papers (2023-02-11T14:54:00Z) - Beyond Distributional Hypothesis: Let Language Models Learn Meaning-Text
Correspondence [45.9949173746044]
We show that large-size pre-trained language models (PLMs) do not satisfy the logical negation property (LNP)
We propose a novel intermediate training task, names meaning-matching, designed to directly learn a meaning-text correspondence.
We find that the task enables PLMs to learn lexical semantic information.
arXiv Detail & Related papers (2022-05-08T08:37:36Z) - A Closer Look at Debiased Temporal Sentence Grounding in Videos:
Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video.
Recent studies have found that current benchmark datasets may have obvious moment annotation biases.
We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z) - With Little Power Comes Great Responsibility [54.96675741328462]
Underpowered experiments make it more difficult to discern the difference between statistical noise and meaningful model improvements.
Small test sets mean that most attempted comparisons to state of the art models will not be adequately powered.
For machine translation, we find that typical test sets of 2000 sentences have approximately 75% power to detect differences of 1 BLEU point.
arXiv Detail & Related papers (2020-10-13T18:00:02Z) - Information-Theoretic Probing for Linguistic Structure [74.04862204427944]
We propose an information-theoretic operationalization of probing as estimating mutual information.
We evaluate on a set of ten typologically diverse languages often underrepresented in NLP research.
arXiv Detail & Related papers (2020-04-07T01:06:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.