Detrimental Contexts in Open-Domain Question Answering
- URL: http://arxiv.org/abs/2310.18077v1
- Date: Fri, 27 Oct 2023 11:45:16 GMT
- Title: Detrimental Contexts in Open-Domain Question Answering
- Authors: Philhoon Oh and James Thorne
- Abstract summary: We analyze how passages can have a detrimental effect on retrieve-then-read architectures used in question answering.
Our findings demonstrate that model accuracy can be improved by 10% on two popular QA datasets by filtering out detrimental passages.
- Score: 9.059854023578508
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: For knowledge intensive NLP tasks, it has been widely accepted that accessing
more information is a contributing factor to improvements in the model's
end-to-end performance. However, counter-intuitively, too much context can have
a negative impact on the model when evaluated on common question answering (QA)
datasets. In this paper, we analyze how passages can have a detrimental effect
on retrieve-then-read architectures used in question answering. Our empirical
evidence indicates that the current read architecture does not fully leverage
the retrieved passages and significantly degrades its performance when using
the whole passages compared to utilizing subsets of them. Our findings
demonstrate that model accuracy can be improved by 10% on two popular QA
datasets by filtering out detrimental passages. Additionally, these outcomes
are attained by utilizing existing retrieval methods without further training
or data. We further highlight the challenges associated with identifying the
detrimental passages. First, even with the correct context, the model can make
an incorrect prediction, posing a challenge in determining which passages are
most influential. Second, evaluation typically considers lexical matching,
which is not robust to variations of correct answers. Despite these
limitations, our experimental results underscore the pivotal role of
identifying and removing these detrimental passages for the context-efficient
retrieve-then-read pipeline. Code and data are available at
https://github.com/xfactlab/emnlp2023-damaging-retrieval
Related papers
- Likelihood as a Performance Gauge for Retrieval-Augmented Generation [78.28197013467157]
We show that likelihoods serve as an effective gauge for language model performance.
We propose two methods that use question likelihood as a gauge for selecting and constructing prompts that lead to better performance.
arXiv Detail & Related papers (2024-11-12T13:14:09Z) - SparseCL: Sparse Contrastive Learning for Contradiction Retrieval [87.02936971689817]
Contradiction retrieval refers to identifying and extracting documents that explicitly disagree with or refute the content of a query.
Existing methods such as similarity search and crossencoder models exhibit significant limitations.
We introduce SparseCL that leverages specially trained sentence embeddings designed to preserve subtle, contradictory nuances between sentences.
arXiv Detail & Related papers (2024-06-15T21:57:03Z) - Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - Answerability in Retrieval-Augmented Open-Domain Question Answering [17.177439885871788]
Open-Domain Question Answering (ODQA) retrieval systems can exhibit sub-optimal behavior, providing text excerpts with varying degrees of irrelevance.
Previous attempts to address this gap have relied on a simplistic approach of pairing questions with random text excerpts.
arXiv Detail & Related papers (2024-03-03T09:55:35Z) - Mitigating the Impact of False Negatives in Dense Retrieval with
Contrastive Confidence Regularization [15.204113965411777]
We propose a novel contrastive confidence regularizer for Noise Contrastive Estimation (NCE) loss.
Our analysis shows that the regularizer helps dense retrieval models be more robust against false negatives with a theoretical guarantee.
arXiv Detail & Related papers (2023-12-30T08:01:57Z) - Making Retrieval-Augmented Language Models Robust to Irrelevant Context [55.564789967211844]
An important desideratum of RALMs, is that retrieved information helps model performance when it is relevant.
Recent work has shown that retrieval augmentation can sometimes have a negative effect on performance.
arXiv Detail & Related papers (2023-10-02T18:52:35Z) - Revisiting text decomposition methods for NLI-based factuality scoring
of summaries [9.044665059626958]
We show that fine-grained decomposition is not always a winning strategy for factuality scoring.
We also show that small changes to previously proposed entailment-based scoring methods can result in better performance.
arXiv Detail & Related papers (2022-11-30T09:54:37Z) - Toward the Understanding of Deep Text Matching Models for Information
Retrieval [72.72380690535766]
This paper aims at testing whether existing deep text matching methods satisfy some fundamental gradients in information retrieval.
Specifically, four attributions are used in our study, i.e., term frequency constraint, term discrimination constraint, length normalization constraints, and TF-length constraint.
Experimental results on LETOR 4.0 and MS Marco show that all the investigated deep text matching methods satisfy the above constraints with high probabilities in statistics.
arXiv Detail & Related papers (2021-08-16T13:33:15Z) - Competency Problems: On Finding and Removing Artifacts in Language Data [50.09608320112584]
We argue that for complex language understanding tasks, all simple feature correlations are spurious.
We theoretically analyze the difficulty of creating data for competency problems when human bias is taken into account.
arXiv Detail & Related papers (2021-04-17T21:34:10Z) - Geometry matters: Exploring language examples at the decision boundary [2.7249290070320034]
BERT, CNN and fasttext are susceptible to word substitutions in high difficulty examples.
On YelpReviewPolarity we observe a correlation coefficient of -0.4 between resilience to perturbations and the difficulty score.
Our approach is simple, architecture agnostic and can be used to study the fragilities of text classification models.
arXiv Detail & Related papers (2020-10-14T16:26:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.