Related papers: Let's CONFER: A Dataset for Evaluating Natural Language Inference Models on CONditional InFERence and Presupposition

Let's CONFER: A Dataset for Evaluating Natural Language Inference Models on CONditional InFERence and Presupposition

URL: http://arxiv.org/abs/2506.06133v1
Date: Fri, 06 Jun 2025 14:42:20 GMT
Title: Let's CONFER: A Dataset for Evaluating Natural Language Inference Models on CONditional InFERence and Presupposition
Authors: Tara Azin, Daniel Dumitrescu, Diana Inkpen, Raj Singh,
Abstract summary: We introduce CONFER, a novel dataset designed to evaluate how NLI models process inference in conditional sentences.<n>We assess the performance of four NLI models, including two pre-trained models, to examine their generalization to conditional reasoning.<n>Our findings indicate that NLI models struggle with presuppositional reasoning in conditionals, and fine-tuning on existing NLI datasets does not necessarily improve their performance.
Score: 6.429761894240061
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Natural Language Inference (NLI) is the task of determining whether a sentence pair represents entailment, contradiction, or a neutral relationship. While NLI models perform well on many inference tasks, their ability to handle fine-grained pragmatic inferences, particularly presupposition in conditionals, remains underexplored. In this study, we introduce CONFER, a novel dataset designed to evaluate how NLI models process inference in conditional sentences. We assess the performance of four NLI models, including two pre-trained models, to examine their generalization to conditional reasoning. Additionally, we evaluate Large Language Models (LLMs), including GPT-4o, LLaMA, Gemma, and DeepSeek-R1, in zero-shot and few-shot prompting settings to analyze their ability to infer presuppositions with and without prior context. Our findings indicate that NLI models struggle with presuppositional reasoning in conditionals, and fine-tuning on existing NLI datasets does not necessarily improve their performance.

Related papers

Pushing the boundary on Natural Language Inference [49.15148871877941]
Natural Language Inference (NLI) is a central task in natural language understanding with applications in fact-checking, question answering and information retrieval.<n>Despite its importance, current NLI systems heavily rely on learning with limiting artifacts and biases, inference and real-world applicability.<n>This work provides a framework for building robust NLI systems without sacrificing quality or real-world applicability.
arXiv Detail & Related papers (2025-04-25T14:20:57Z)
Improving Context-Aware Preference Modeling for Language Models [62.32080105403915]
We consider the two-step preference modeling procedure that first resolves the under-specification by selecting a context, and then evaluates preference with respect to the chosen context. We contribute context-conditioned preference datasets and experiments that investigate the ability of language models to evaluate context-specific preference.
arXiv Detail & Related papers (2024-07-20T16:05:17Z)
With a Little Push, NLI Models can Robustly and Efficiently Predict Faithfulness [19.79160738554967]
Conditional language models still generate unfaithful output that is not supported by their input. We show that pure NLI models can outperform more complex metrics when combining task-adaptive data augmentation with robust inference procedures.
arXiv Detail & Related papers (2023-05-26T11:00:04Z)
FOLIO: Natural Language Reasoning with First-Order Logic [147.50480350846726]
We present FOLIO, a human-annotated, logically complex and diverse dataset for reasoning in natural language (NL) FOLIO consists of 1,430 examples (unique conclusions), each paired with one of 487 sets of premises used to deductively reason for the validity of each conclusion. For both NL reasoning and NL-FOL translation, we benchmark multiple state-of-the-art language models.
arXiv Detail & Related papers (2022-09-02T06:50:11Z)
Stretching Sentence-pair NLI Models to Reason over Long Documents and Clusters [35.103851212995046]
Natural Language Inference (NLI) has been extensively studied by the NLP community as a framework for estimating the semantic relation between sentence pairs. We explore the direct zero-shot applicability of NLI models to real applications, beyond the sentence-pair setting they were trained on. We develop new aggregation methods to allow operating over full documents, reaching state-of-the-art performance on the ContractNLI dataset.
arXiv Detail & Related papers (2022-04-15T12:56:39Z)
e-ViL: A Dataset and Benchmark for Natural Language Explanations in Vision-Language Tasks [52.918087305406296]
We introduce e-ViL, a benchmark for evaluate explainable vision-language tasks. We also introduce e-SNLI-VE, the largest existing dataset with NLEs. We propose a new model that combines UNITER, which learns joint embeddings of images and text, and GPT-2, a pre-trained language model.
arXiv Detail & Related papers (2021-05-08T18:46:33Z)
Exploring Transitivity in Neural NLI Models through Veridicality [39.845425535943534]
We focus on the transitivity of inference relations, a fundamental property for systematically drawing inferences. A model capturing transitivity can compose basic inference patterns and draw new inferences. We find that current NLI models do not perform consistently well on transitivity inference tasks.
arXiv Detail & Related papers (2021-01-26T11:18:35Z)
Exploring Lexical Irregularities in Hypothesis-Only Models of Natural Language Inference [5.283529004179579]
Natural Language Inference (NLI) or Recognizing Textual Entailment (RTE) is the task of predicting the entailment relation between a pair of sentences. Models that understand entailment should encode both, the premise and the hypothesis. Experiments by Poliak et al. revealed a strong preference of these models towards patterns observed only in the hypothesis.
arXiv Detail & Related papers (2021-01-19T01:08:06Z)
Infusing Finetuning with Semantic Dependencies [62.37697048781823]
We show that, unlike syntax, semantics is not brought to the surface by today's pretrained models. We then use convolutional graph encoders to explicitly incorporate semantic parses into task-specific finetuning.
arXiv Detail & Related papers (2020-12-10T01:27:24Z)
Reading Comprehension as Natural Language Inference: A Semantic Analysis [15.624486319943015]
We explore the utility of Natural language Inference (NLI) for Question Answering (QA) We transform the one of the largest available MRC dataset (RACE) to an NLI form, and compare the performances of a state-of-the-art model (RoBERTa) on both forms. We highlight clear categories for which the model is able to perform better when the data is presented in a coherent entailment form, and a structured question-answer concatenation form.
arXiv Detail & Related papers (2020-10-04T22:50:59Z)
Coreferential Reasoning Learning for Language Representation [88.14248323659267]
We present CorefBERT, a novel language representation model that can capture the coreferential relations in context. The experimental results show that, compared with existing baseline models, CorefBERT can achieve significant improvements consistently on various downstream NLP tasks.
arXiv Detail & Related papers (2020-04-15T03:57:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.