Negation: A Pink Elephant in the Large Language Models' Room?
- URL: http://arxiv.org/abs/2503.22395v1
- Date: Fri, 28 Mar 2025 13:04:41 GMT
- Title: Negation: A Pink Elephant in the Large Language Models' Room?
- Authors: Tereza Vrabcová, Marek Kadlčík, Petr Sojka, Michal Štefánik, Michal Spiegel,
- Abstract summary: Negations are key to determining sentence meaning, making them essential for logical reasoning.<n>We investigate how model size and language impact its ability to handle negation correctly by evaluating popular language models.<n>Our datasets can facilitate further research and improvements of language model reasoning in multilingual settings.
- Score: 2.8078480738404
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Negations are key to determining sentence meaning, making them essential for logical reasoning. Despite their importance, negations pose a substantial challenge for large language models (LLMs) and remain underexplored. We construct two multilingual natural language inference (NLI) datasets with \textit{paired} examples differing in negation. We investigate how model size and language impact its ability to handle negation correctly by evaluating popular LLMs. Contrary to previous work, we show that increasing the model size consistently improves the models' ability to handle negations. Furthermore, we find that both the models' reasoning accuracy and robustness to negation are language-dependent and that the length and explicitness of the premise have a greater impact on robustness than language. Our datasets can facilitate further research and improvements of language model reasoning in multilingual settings.
Related papers
- ExpliCa: Evaluating Explicit Causal Reasoning in Large Language Models [75.05436691700572]
We introduce ExpliCa, a new dataset for evaluating Large Language Models (LLMs) in explicit causal reasoning.<n>We tested seven commercial and open-source LLMs on ExpliCa through prompting and perplexity-based metrics.<n>Surprisingly, models tend to confound temporal relations with causal ones, and their performance is also strongly influenced by the linguistic order of the events.
arXiv Detail & Related papers (2025-02-21T14:23:14Z) - This is not a Dataset: A Large Negation Benchmark to Challenge Large
Language Models [4.017326849033009]
We try to clarify the reasons for the sub-optimal performance of large language models understanding negation.
We introduce a large semi-automatically generated dataset of circa 400,000 descriptive sentences about commonsense knowledge.
We have used our dataset with the largest available open LLMs in a zero-shot approach to grasp their generalization and inference capability.
arXiv Detail & Related papers (2023-10-24T15:38:21Z) - Rarely a problem? Language models exhibit inverse scaling in their
predictions following few-type quantifiers [0.6091702876917281]
We focus on 'few'-type quantifiers, as in 'few children like toys', which might pose a particular challenge for language models.
We present 960 English sentence stimuli from two human neurolinguistic experiments to 22 autoregressive transformer models of differing sizes.
arXiv Detail & Related papers (2022-12-16T20:01:22Z) - Transparency Helps Reveal When Language Models Learn Meaning [71.96920839263457]
Our systematic experiments with synthetic data reveal that, with languages where all expressions have context-independent denotations, both autoregressive and masked language models learn to emulate semantic relations between expressions.
Turning to natural language, our experiments with a specific phenomenon -- referential opacity -- add to the growing body of evidence that current language models do not well-represent natural language semantics.
arXiv Detail & Related papers (2022-10-14T02:35:19Z) - Do Multilingual Language Models Capture Differing Moral Norms? [71.52261949766101]
Massively multilingual sentence representations are trained on large corpora of uncurated data.
This may cause the models to grasp cultural values including moral judgments from the high-resource languages.
The lack of data in certain languages can also lead to developing random and thus potentially harmful beliefs.
arXiv Detail & Related papers (2022-03-18T12:26:37Z) - Interpreting Language Models with Contrastive Explanations [99.7035899290924]
Language models must consider various features to predict a token, such as its part of speech, number, tense, or semantics.
Existing explanation methods conflate evidence for all these features into a single explanation, which is less interpretable for human understanding.
We show that contrastive explanations are quantifiably better than non-contrastive explanations in verifying major grammatical phenomena.
arXiv Detail & Related papers (2022-02-21T18:32:24Z) - Understanding by Understanding Not: Modeling Negation in Language Models [81.21351681735973]
Negation is a core construction in natural language.
We propose to augment the language modeling objective with an unlikelihood objective that is based on negated generic sentences.
We reduce the mean top1 error rate to 4% on the negated LAMA dataset.
arXiv Detail & Related papers (2021-05-07T21:58:35Z) - Detecting and Exorcising Statistical Demons from Language Models with
Anti-Models of Negative Data [13.392212395386933]
We find that within a model family, as the number of parameters, training epochs, and data set size increase, so does a model's ability to generalize to negative n-gram data.
We propose a form of inductive bias that attenuates such undesirable signals with negative data distributions automatically learned from positive data.
arXiv Detail & Related papers (2020-10-22T16:45:32Z) - Recurrent Neural Network Language Models Always Learn English-Like
Relative Clause Attachment [17.995905582226463]
We compare model performance in English and Spanish to show that non-linguistic biases in RNN LMs advantageously overlap with syntactic structure in English but not Spanish.
English models may appear to acquire human-like syntactic preferences, while models trained on Spanish fail to acquire comparable human-like preferences.
arXiv Detail & Related papers (2020-05-01T01:21:47Z) - Limits of Detecting Text Generated by Large-Scale Language Models [65.46403462928319]
Some consider large-scale language models that can generate long and coherent pieces of text as dangerous, since they may be used in misinformation campaigns.
Here we formulate large-scale language model output detection as a hypothesis testing problem to classify text as genuine or generated.
arXiv Detail & Related papers (2020-02-09T19:53:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.