Simple Linguistic Inferences of Large Language Models (LLMs): Blind Spots and Blinds
- URL: http://arxiv.org/abs/2305.14785v2
- Date: Thu, 11 Apr 2024 11:16:45 GMT
- Title: Simple Linguistic Inferences of Large Language Models (LLMs): Blind Spots and Blinds
- Authors: Victoria Basmov, Yoav Goldberg, Reut Tsarfaty,
- Abstract summary: We evaluate language understanding capacities on simple inference tasks that most humans find trivial.
We target (i) grammatically-specified entailments, (ii) premises with evidential adverbs of uncertainty, and (iii) monotonicity entailments.
The models exhibit moderate to low performance on these evaluation sets.
- Score: 59.71218039095155
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We evaluate LLMs' language understanding capacities on simple inference tasks that most humans find trivial. Specifically, we target (i) grammatically-specified entailments, (ii) premises with evidential adverbs of uncertainty, and (iii) monotonicity entailments. We design evaluation sets for these tasks and conduct experiments in both zero-shot and chain-of-thought setups, and with multiple prompts and LLMs. The models exhibit moderate to low performance on these evaluation sets. Subsequent experiments show that embedding the premise in syntactic constructions that should preserve the entailment relations (presupposition triggers) or change them (non-factives), further confuses the models, causing them to either under-predict or over-predict certain entailment labels regardless of the true relation, and often disregarding the nature of the embedding context. Overall these results suggest that, despite LLMs' celebrated language understanding capacity, even the strongest models have blindspots with respect to certain types of entailments, and certain information-packaging structures act as ``blinds'' overshadowing the semantics of the embedded premise.
Related papers
- Unveiling the Capabilities of Large Language Models in Detecting Offensive Language with Annotation Disagreement [22.992484902761994]
This study systematically evaluates the performance of multiple Large Language Models (LLMs) in detecting offensive language.
We analyze binary classification accuracy, examine the relationship between model confidence and human disagreement, and explore how disagreement samples influence model decision-making.
arXiv Detail & Related papers (2025-02-10T07:14:26Z) - Assessing Language Comprehension in Large Language Models Using Construction Grammar [3.0906699069248806]
Construction Grammar (CxG) provides insights into the meaning captured by linguistic elements known as constructions (Cxns)
These datasets are carefully constructed to include examples which are unlikely to appear in pre-training data, yet intuitive and easy for humans to understand.
Our experiments focus on downstream natural language inference and reasoning tasks by comparing LLMs' understanding of the underlying meanings communicated through 8 unique Cxns with that of humans.
arXiv Detail & Related papers (2025-01-08T18:15:10Z) - Aggregation Artifacts in Subjective Tasks Collapse Large Language Models' Posteriors [74.04775677110179]
In-context Learning (ICL) has become the primary method for performing natural language tasks with Large Language Models (LLMs)
In this work, we examine whether this is the result of the aggregation used in corresponding datasets, where trying to combine low-agreement, disparate annotations might lead to annotation artifacts that create detrimental noise in the prompt.
Our results indicate that aggregation is a confounding factor in the modeling of subjective tasks, and advocate focusing on modeling individuals instead.
arXiv Detail & Related papers (2024-10-17T17:16:00Z) - Traffic Light or Light Traffic? Investigating Phrasal Semantics in Large Language Models [41.233879429714925]
This study critically examines the capacity of API-based large language models to comprehend phrase semantics.
We assess the performance of LLMs in executing phrase semantic reasoning tasks guided by natural language instructions.
We conduct detailed error analyses to interpret the limitations faced by LLMs in comprehending phrase semantics.
arXiv Detail & Related papers (2024-10-03T08:44:17Z) - Uncertainty Quantification for In-Context Learning of Large Language Models [52.891205009620364]
In-context learning has emerged as a groundbreaking ability of Large Language Models (LLMs)
We propose a novel formulation and corresponding estimation method to quantify both types of uncertainties.
The proposed method offers an unsupervised way to understand the prediction of in-context learning in a plug-and-play fashion.
arXiv Detail & Related papers (2024-02-15T18:46:24Z) - Explanation-aware Soft Ensemble Empowers Large Language Model In-context
Learning [50.00090601424348]
Large language models (LLMs) have shown remarkable capabilities in various natural language understanding tasks.
We propose EASE, an Explanation-Aware Soft Ensemble framework to empower in-context learning with LLMs.
arXiv Detail & Related papers (2023-11-13T06:13:38Z) - Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools.
Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions.
Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z) - Masked Language Modeling and the Distributional Hypothesis: Order Word
Matters Pre-training for Little [74.49773960145681]
A possible explanation for the impressive performance of masked language model (MLM)-training is that such models have learned to represent the syntactic structures prevalent in NLP pipelines.
In this paper, we propose a different explanation: pre-trains succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics.
Our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.
arXiv Detail & Related papers (2021-04-14T06:30:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.