What Gives the Answer Away? Question Answering Bias Analysis on Video QA
Datasets
- URL: http://arxiv.org/abs/2007.03626v1
- Date: Tue, 7 Jul 2020 17:00:11 GMT
- Title: What Gives the Answer Away? Question Answering Bias Analysis on Video QA
Datasets
- Authors: Jianing Yang, Yuying Zhu, Yongxin Wang, Ruitao Yi, Amir Zadeh,
Louis-Philippe Morency
- Abstract summary: Question answering biases in video QA datasets can mislead multimodal model to overfit to QA artifacts.
Our study shows biases can come from annotators and type of questions.
We also show empirically that using annotator-non-overlapping train-test splits can reduce QA biases for video QA datasets.
- Score: 40.64071905569975
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Question answering biases in video QA datasets can mislead multimodal model
to overfit to QA artifacts and jeopardize the model's ability to generalize.
Understanding how strong these QA biases are and where they come from helps the
community measure progress more accurately and provide researchers insights to
debug their models. In this paper, we analyze QA biases in popular video
question answering datasets and discover pretrained language models can answer
37-48% questions correctly without using any multimodal context information,
far exceeding the 20% random guess baseline for 5-choose-1 multiple-choice
questions. Our ablation study shows biases can come from annotators and type of
questions. Specifically, annotators that have been seen during training are
better predicted by the model and reasoning, abstract questions incur more
biases than factual, direct questions. We also show empirically that using
annotator-non-overlapping train-test splits can reduce QA biases for video QA
datasets.
Related papers
- Mitigating Bias for Question Answering Models by Tracking Bias Influence [84.66462028537475]
We propose BMBI, an approach to mitigate the bias of multiple-choice QA models.
Based on the intuition that a model would lean to be more biased if it learns from a biased example, we measure the bias level of a query instance.
We show that our method could be applied to multiple QA formulations across multiple bias categories.
arXiv Detail & Related papers (2023-10-13T00:49:09Z) - Open-vocabulary Video Question Answering: A New Benchmark for Evaluating
the Generalizability of Video Question Answering Models [15.994664381976984]
We introduce a new benchmark, Open-vocabulary Video Question Answering (OVQA), to measure the generalizability of VideoQA models.
In addition, we introduce a novel GNN-based soft verbalizer that enhances the prediction on rare and unseen answers.
Our ablation studies and qualitative analyses demonstrate that our GNN-based soft verbalizer further improves the model performance.
arXiv Detail & Related papers (2023-08-18T07:45:10Z) - CREPE: Open-Domain Question Answering with False Presuppositions [92.20501870319765]
We introduce CREPE, a QA dataset containing a natural distribution of presupposition failures from online information-seeking forums.
We find that 25% of questions contain false presuppositions, and provide annotations for these presuppositions and their corrections.
We show that adaptations of existing open-domain QA models can find presuppositions moderately well, but struggle when predicting whether a presupposition is factually correct.
arXiv Detail & Related papers (2022-11-30T18:54:49Z) - NOAHQA: Numerical Reasoning with Interpretable Graph Question Answering
Dataset [26.782937852417454]
We introduce NOAHQA, a bilingual QA dataset with questions requiring numerical reasoning with compound mathematical expressions.
We evaluate the state-of-the-art QA models trained using existing QA datasets on NOAHQA and show that the best among them can only achieve 55.5 exact match scores.
We also present a new QA model for generating a reasoning graph where the reasoning graph metric still has a large gap compared with that of humans.
arXiv Detail & Related papers (2021-09-22T09:17:09Z) - UnQovering Stereotyping Biases via Underspecified Questions [68.81749777034409]
We present UNQOVER, a framework to probe and quantify biases through underspecified questions.
We show that a naive use of model scores can lead to incorrect bias estimates due to two forms of reasoning errors.
We use this metric to analyze four important classes of stereotypes: gender, nationality, ethnicity, and religion.
arXiv Detail & Related papers (2020-10-06T01:49:52Z) - Selective Question Answering under Domain Shift [90.021577320085]
Abstention policies based solely on the model's softmax probabilities fare poorly, since models are overconfident on out-of-domain inputs.
We train a calibrator to identify inputs on which the QA model errs, and abstain when it predicts an error is likely.
Our method answers 56% of questions while maintaining 80% accuracy; in contrast, directly using the model's probabilities only answers 48% at 80% accuracy.
arXiv Detail & Related papers (2020-06-16T19:13:21Z) - SQuINTing at VQA Models: Introspecting VQA Models with Sub-Questions [66.86887670416193]
We show that state-of-the-art VQA models have comparable performance in answering perception and reasoning questions, but suffer from consistency problems.
To address this shortcoming, we propose an approach called Sub-Question-aware Network Tuning (SQuINT)
We show that SQuINT improves model consistency by 5%, also marginally improving performance on the Reasoning questions in VQA, while also displaying better attention maps.
arXiv Detail & Related papers (2020-01-20T01:02:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.