Related papers: Can NLP Models Correctly Reason Over Contexts that Break the Common Assumptions?

Can NLP Models Correctly Reason Over Contexts that Break the Common Assumptions?

URL: http://arxiv.org/abs/2305.12096v1
Date: Sat, 20 May 2023 05:20:37 GMT
Title: Can NLP Models Correctly Reason Over Contexts that Break the Common Assumptions?
Authors: Neeraj Varshney, Mihir Parmar, Nisarg Patel, Divij Handa, Sayantan Sarkar, Man Luo, Chitta Baral
Abstract summary: We investigate the ability of NLP models to correctly reason over contexts that break the common assumptions. We show that while doing fairly well on contexts that follow the common assumptions, the models struggle to correctly reason over contexts that break those assumptions. Specifically, the performance gap is as high as 20% absolute points.
Score: 14.991565484636745
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Pre-training on large corpora of text enables the language models to acquire a vast amount of factual and commonsense knowledge which allows them to achieve remarkable performance on a variety of language understanding tasks. They typically acquire this knowledge by learning from the pre-training text and capturing certain patterns from it. However, real-world settings often present scenarios that do not abide by these patterns i.e. scenarios that break the common assumptions. Can state-of-the-art NLP models correctly reason over the contexts of such scenarios? Addressing the above question, in this paper, we investigate the ability of models to correctly reason over contexts that break the common assumptions. To this end, we first systematically create evaluation data in which each data instance consists of (a) a common assumption, (b) a context that follows the assumption, (c) a context that breaks the assumption, and (d) questions based on the contexts. Then, through evaluations on multiple models including GPT-3 and Flan T5, we show that while doing fairly well on contexts that follow the common assumptions, the models struggle to correctly reason over contexts that break those assumptions. Specifically, the performance gap is as high as 20% absolute points. Furthermore, we thoroughly analyze these results revealing several interesting findings. We believe our work and findings will encourage and facilitate further research in developing more robust models that can also reliably reason over contexts that break the common assumptions. Data is available at \url{https://github.com/nrjvarshney/break_the_common_assumptions}.

Related papers

A Closer Look at Bias and Chain-of-Thought Faithfulness of Large (Vision) Language Models [53.18562650350898]
Chain-of-thought (CoT) reasoning enhances performance of large language models.<n>We present the first comprehensive study of CoT faithfulness in large vision-language models.
arXiv Detail & Related papers (2025-05-29T18:55:05Z)
Context is Key: A Benchmark for Forecasting with Essential Textual Information [87.3175915185287]
"Context is Key" (CiK) is a time series forecasting benchmark that pairs numerical data with diverse types of carefully crafted textual context. We evaluate a range of approaches, including statistical models, time series foundation models, and LLM-based forecasters. Our experiments highlight the importance of incorporating contextual information, demonstrate surprising performance when using LLM-based forecasting models, and also reveal some of their critical shortcomings.
arXiv Detail & Related papers (2024-10-24T17:56:08Z)
QUITE: Quantifying Uncertainty in Natural Language Text in Bayesian Reasoning Scenarios [15.193544498311603]
We present QUITE, a dataset of real-world Bayesian reasoning scenarios with categorical random variables and complex relationships. We conduct an extensive set of experiments, finding that logic-based models outperform out-of-the-box large language models on all reasoning types. Our results provide evidence that neuro-symbolic models are a promising direction for improving complex reasoning.
arXiv Detail & Related papers (2024-10-14T12:44:59Z)
Detecting Multimodal Situations with Insufficient Context and Abstaining from Baseless Predictions [75.45274978665684]
Vision-Language Understanding (VLU) benchmarks contain samples where answers rely on assumptions unsupported by the provided context. We collect contextual data for each sample whenever available and train a context selection module to facilitate evidence-based model predictions. We develop a general-purpose Context-AwaRe Abstention detector to identify samples lacking sufficient context and enhance model accuracy.
arXiv Detail & Related papers (2024-05-18T02:21:32Z)
Evaluating Consistency and Reasoning Capabilities of Large Language Models [0.0]
Large Language Models (LLMs) are extensively used today across various sectors, including academia, research, business, and finance. Despite their widespread adoption, these models often produce incorrect and misleading information, exhibiting a tendency to hallucinate. This paper aims to evaluate and compare the consistency and reasoning capabilities of both public and proprietary LLMs.
arXiv Detail & Related papers (2024-04-25T10:03:14Z)
Context versus Prior Knowledge in Language Models [49.17879668110546]
Language models often need to integrate prior knowledge learned during pretraining and new information presented in context. We propose two mutual information-based metrics to measure a model's dependency on a context and on its prior about an entity.
arXiv Detail & Related papers (2024-04-06T13:46:53Z)
DAIL: Data Augmentation for In-Context Learning via Self-Paraphrase [37.68804898063595]
In-Context Learning (ICL) combined with pre-trained large language models has achieved promising results on various NLP tasks. We propose textbfData textbfAugmentation for textbfIn-Context textbfLearning (textbfDAIL)
arXiv Detail & Related papers (2023-11-06T18:12:55Z)
On Context Utilization in Summarization with Large Language Models [83.84459732796302]
Large language models (LLMs) excel in abstractive summarization tasks, delivering fluent and pertinent summaries. Recent advancements have extended their capabilities to handle long-input contexts, exceeding 100k tokens. We conduct the first comprehensive study on context utilization and position bias in summarization.
arXiv Detail & Related papers (2023-10-16T16:45:12Z)
Preserving Knowledge Invariance: Rethinking Robustness Evaluation of Open Information Extraction [50.62245481416744]
We present the first benchmark that simulates the evaluation of open information extraction models in the real world. We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique. By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques.
arXiv Detail & Related papers (2023-05-23T12:05:09Z)
Large Language Models with Controllable Working Memory [64.71038763708161]
Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP) What further sets these models apart is the massive amounts of world knowledge they internalize during pretraining. How the model's world knowledge interacts with the factual information presented in the context remains under explored.
arXiv Detail & Related papers (2022-11-09T18:58:29Z)
Context vs Target Word: Quantifying Biases in Lexical Semantic Datasets [18.754562380068815]
State-of-the-art contextualized models such as BERT use tasks such as WiC and WSD to evaluate their word-in-context representations. This study presents the first quantitative analysis (using probing baselines) on the context-word interaction being tested in major contextual lexical semantic tasks.
arXiv Detail & Related papers (2021-12-13T15:37:05Z)
Toward the Understanding of Deep Text Matching Models for Information Retrieval [72.72380690535766]
This paper aims at testing whether existing deep text matching methods satisfy some fundamental gradients in information retrieval. Specifically, four attributions are used in our study, i.e., term frequency constraint, term discrimination constraint, length normalization constraints, and TF-length constraint. Experimental results on LETOR 4.0 and MS Marco show that all the investigated deep text matching methods satisfy the above constraints with high probabilities in statistics.
arXiv Detail & Related papers (2021-08-16T13:33:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.