On Reference (In-)Determinacy in Natural Language Inference
- URL: http://arxiv.org/abs/2502.05793v1
- Date: Sun, 09 Feb 2025 06:58:13 GMT
- Title: On Reference (In-)Determinacy in Natural Language Inference
- Authors: Sihao Chen, Chaitanya Malaviya, Alex Fabrikant, Hagai Taitelbaum, Tal Schuster, Senaka Buthpitiya, Dan Roth,
- Abstract summary: We revisit the reference determinacy (RD) assumption in the task of natural language inference (NLI)
We observe that current NLI models fail in downstream applications such as fact verification, where the input premise and hypothesis may refer to different contexts.
We introduce RefNLI, a diagnostic benchmark for identifying reference ambiguity in NLI examples.
- Score: 62.904689974282334
- License:
- Abstract: We revisit the reference determinacy (RD) assumption in the task of natural language inference (NLI), i.e., the premise and hypothesis are assumed to refer to the same context when human raters annotate a label. While RD is a practical assumption for constructing a new NLI dataset, we observe that current NLI models, which are typically trained solely on hypothesis-premise pairs created with the RD assumption, fail in downstream applications such as fact verification, where the input premise and hypothesis may refer to different contexts. To highlight the impact of this phenomenon in real-world use cases, we introduce RefNLI, a diagnostic benchmark for identifying reference ambiguity in NLI examples. In RefNLI, the premise is retrieved from a knowledge source (i.e., Wikipedia) and does not necessarily refer to the same context as the hypothesis. With RefNLI, we demonstrate that finetuned NLI models and few-shot prompted LLMs both fail to recognize context mismatch, leading to over 80% false contradiction and over 50% entailment predictions. We discover that the existence of reference ambiguity in NLI examples can in part explain the inherent human disagreements in NLI and provide insight into how the RD assumption impacts the NLI dataset creation process.
Related papers
- Hypothesis-only Biases in Large Language Model-Elicited Natural Language Inference [3.0804372027733202]
We recreate a portion of the Stanford NLI corpus using GPT-4, Llama-2 and Mistral 7b.
We train hypothesis-only classifiers to determine whether LLM-elicited hypotheses contain annotation artifacts.
Our analysis provides empirical evidence that well-attested biases in NLI can persist in LLM-generated data.
arXiv Detail & Related papers (2024-10-11T17:09:22Z) - HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction.
The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses.
LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z) - No Strong Feelings One Way or Another: Re-operationalizing Neutrality in
Natural Language Inference [6.485890157501745]
Natural Language Inference (NLI) has been a cornerstone task in evaluating language models' inferential reasoning capabilities.
Standard three-way classification scheme used in NLI has well-known shortcomings in evaluating models' ability to capture the nuances of natural human reasoning.
We argue that the operationalization of the neutral label in current NLI datasets has low validity, is interpreted inconsistently, and that at least one important sense of neutrality is often ignored.
arXiv Detail & Related papers (2023-06-16T15:45:08Z) - AMR4NLI: Interpretable and robust NLI measures from semantic graphs [28.017617759762278]
Natural language inference asks whether a given premise entails a given hypothesis.
We compare semantic structures to represent premise and hypothesis, including sets of contextualized embeddings and semantic graphs.
Our evaluation finds value in both contextualized embeddings and semantic graphs.
arXiv Detail & Related papers (2023-06-01T17:39:40Z) - Not All Metrics Are Guilty: Improving NLG Evaluation by Diversifying References [123.39034752499076]
Div-Ref is a method to enhance evaluation benchmarks by enriching the number of references.
We conduct experiments to empirically demonstrate that diversifying the expression of reference can significantly enhance the correlation between automatic evaluation and human evaluation.
arXiv Detail & Related papers (2023-05-24T11:53:29Z) - Semantic Role Labeling Meets Definition Modeling: Using Natural Language
to Describe Predicate-Argument Structures [104.32063681736349]
We present an approach to describe predicate-argument structures using natural language definitions instead of discrete labels.
Our experiments and analyses on PropBank-style and FrameNet-style, dependency-based and span-based SRL also demonstrate that a flexible model with an interpretable output does not necessarily come at the expense of performance.
arXiv Detail & Related papers (2022-12-02T11:19:16Z) - Not another Negation Benchmark: The NaN-NLI Test Suite for Sub-clausal
Negation [59.307534363825816]
Negation is poorly captured by current language models, although the extent of this problem is not widely understood.
We introduce a natural language inference (NLI) test suite to enable probing the capabilities of NLP methods.
arXiv Detail & Related papers (2022-10-06T23:39:01Z) - Stretching Sentence-pair NLI Models to Reason over Long Documents and
Clusters [35.103851212995046]
Natural Language Inference (NLI) has been extensively studied by the NLP community as a framework for estimating the semantic relation between sentence pairs.
We explore the direct zero-shot applicability of NLI models to real applications, beyond the sentence-pair setting they were trained on.
We develop new aggregation methods to allow operating over full documents, reaching state-of-the-art performance on the ContractNLI dataset.
arXiv Detail & Related papers (2022-04-15T12:56:39Z) - Automatically Identifying Semantic Bias in Crowdsourced Natural Language
Inference Datasets [78.6856732729301]
We introduce a model-driven, unsupervised technique to find "bias clusters" in a learned embedding space of hypotheses in NLI datasets.
interventions and additional rounds of labeling can be performed to ameliorate the semantic bias of the hypothesis distribution of a dataset.
arXiv Detail & Related papers (2021-12-16T22:49:01Z) - Exploring Lexical Irregularities in Hypothesis-Only Models of Natural
Language Inference [5.283529004179579]
Natural Language Inference (NLI) or Recognizing Textual Entailment (RTE) is the task of predicting the entailment relation between a pair of sentences.
Models that understand entailment should encode both, the premise and the hypothesis.
Experiments by Poliak et al. revealed a strong preference of these models towards patterns observed only in the hypothesis.
arXiv Detail & Related papers (2021-01-19T01:08:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.