Related papers: Towards Deconfounding the Influence of Subject's Demographic Characteristics in Question Answering

Towards Deconfounding the Influence of Subject's Demographic Characteristics in Question Answering

URL: http://arxiv.org/abs/2104.07571v1
Date: Thu, 15 Apr 2021 16:26:54 GMT
Title: Towards Deconfounding the Influence of Subject's Demographic Characteristics in Question Answering
Authors: Maharshi Gor, Kellie Webster, and Jordan Boyd-Graber
Abstract summary: Question Answering tasks are used as benchmarks of general machine intelligence. Major QA datasets have skewed distributions over gender, profession, and nationality. We find little evidence that accuracy is lower for people based on gender or nationality.
Score: 4.540236408836132
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Question Answering (QA) tasks are used as benchmarks of general machine intelligence. Therefore, robust QA evaluation is critical, and metrics should indicate how models will answer any question. However, major QA datasets have skewed distributions over gender, profession, and nationality. Despite that skew, models generalize -- we find little evidence that accuracy is lower for people based on gender or nationality. Instead, there is more variation in question topic and question ambiguity. Adequately accessing the generalization of QA systems requires more representative datasets.

Related papers

Wrong Answers Can Also Be Useful: PlausibleQA -- A Large-Scale QA Dataset with Answer Plausibility Scores [16.434748534272014]
PlausibleQA is a dataset of 10,000 questions and 100,000 candidate answers annotated with plausibility scores and justifications. We show that plausibility-aware approaches are effective for Multiple-Choice Question Answering (MCQA) and QARA.
arXiv Detail & Related papers (2025-02-22T21:14:18Z)
UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA. We first augment the existing data via deliberate perturbations on either the image or question. We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z)
Mitigating Bias for Question Answering Models by Tracking Bias Influence [84.66462028537475]
We propose BMBI, an approach to mitigate the bias of multiple-choice QA models. Based on the intuition that a model would lean to be more biased if it learns from a biased example, we measure the bias level of a query instance. We show that our method could be applied to multiple QA formulations across multiple bias categories.
arXiv Detail & Related papers (2023-10-13T00:49:09Z)
SQUARE: Automatic Question Answering Evaluation using Multiple Positive and Negative References [73.67707138779245]
We propose a new evaluation metric: SQuArE (Sentence-level QUestion AnsweRing Evaluation) We evaluate SQuArE on both sentence-level extractive (Answer Selection) and generative (GenQA) QA systems.
arXiv Detail & Related papers (2023-09-21T16:51:30Z)
Evaluation of Question Answering Systems: Complexity of judging a natural language [3.4771957347698583]
Question answering (QA) systems are among the most important and rapidly developing research topics in natural language processing (NLP) This survey attempts to provide a systematic overview of the general framework of QA, QA paradigms, benchmark datasets, and assessment techniques for a quantitative evaluation of QA systems.
arXiv Detail & Related papers (2022-09-10T12:29:04Z)
BBQ: A Hand-Built Bias Benchmark for Question Answering [25.108222728383236]
It is well documented that NLP models learn social biases present in the world, but little work has been done to show how these biases manifest in actual model outputs for applied tasks like question answering (QA) We introduce the Bias Benchmark for QA (BBQ), a dataset consisting of question-sets constructed by the authors that highlight textitattested social biases against people belonging to protected classes along nine different social dimensions relevant for U.S. English-speaking contexts. We find that models strongly rely on stereotypes when the context is ambiguous, meaning that the model's outputs consistently reproduce harmful biases in this setting
arXiv Detail & Related papers (2021-10-15T16:43:46Z)
NOAHQA: Numerical Reasoning with Interpretable Graph Question Answering Dataset [26.782937852417454]
We introduce NOAHQA, a bilingual QA dataset with questions requiring numerical reasoning with compound mathematical expressions. We evaluate the state-of-the-art QA models trained using existing QA datasets on NOAHQA and show that the best among them can only achieve 55.5 exact match scores. We also present a new QA model for generating a reasoning graph where the reasoning graph metric still has a large gap compared with that of humans.
arXiv Detail & Related papers (2021-09-22T09:17:09Z)
Unsupervised Evaluation for Question Answering with Transformers [46.16837670041594]
We investigate the hidden representations of questions, answers, and contexts in transformer-based QA architectures. We observe a consistent pattern in the answer representations, which we show can be used to automatically evaluate whether or not a predicted answer is correct. We are able to predict whether or not a model's answer is correct with 91.37% accuracy SQuAD, and 80.7% accuracy on SubjQA.
arXiv Detail & Related papers (2020-10-07T07:03:30Z)
UnQovering Stereotyping Biases via Underspecified Questions [68.81749777034409]
We present UNQOVER, a framework to probe and quantify biases through underspecified questions. We show that a naive use of model scores can lead to incorrect bias estimates due to two forms of reasoning errors. We use this metric to analyze four important classes of stereotypes: gender, nationality, ethnicity, and religion.
arXiv Detail & Related papers (2020-10-06T01:49:52Z)
KPQA: A Metric for Generative Question Answering Using Keyphrase Weights [64.54593491919248]
KPQA-metric is a new metric for evaluating correctness of generative question answering systems. Our new metric assigns different weights to each token via keyphrase prediction. We show that our proposed metric has a significantly higher correlation with human judgments than existing metrics.
arXiv Detail & Related papers (2020-05-01T03:24:36Z)
SQuINTing at VQA Models: Introspecting VQA Models with Sub-Questions [66.86887670416193]
We show that state-of-the-art VQA models have comparable performance in answering perception and reasoning questions, but suffer from consistency problems. To address this shortcoming, we propose an approach called Sub-Question-aware Network Tuning (SQuINT) We show that SQuINT improves model consistency by 5%, also marginally improving performance on the Reasoning questions in VQA, while also displaying better attention maps.
arXiv Detail & Related papers (2020-01-20T01:02:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.