Lost in Inference: Rediscovering the Role of Natural Language Inference for Large Language Models
- URL: http://arxiv.org/abs/2411.14103v1
- Date: Thu, 21 Nov 2024 13:09:36 GMT
- Title: Lost in Inference: Rediscovering the Role of Natural Language Inference for Large Language Models
- Authors: Lovish Madaan, David Esiobu, Pontus Stenetorp, Barbara Plank, Dieuwke Hupkes,
- Abstract summary: In the recent past, a popular way of evaluating natural language understanding (NLU) was to consider a model's ability to perform natural language inference (NLI) tasks.
This paper focuses on five different NLI benchmarks across six models of different scales.
We investigate if they are able to discriminate models of different size and quality and how their accuracies develop during training.
- Score: 36.983534612895156
- License:
- Abstract: In the recent past, a popular way of evaluating natural language understanding (NLU), was to consider a model's ability to perform natural language inference (NLI) tasks. In this paper, we investigate if NLI tasks, that are rarely used for LLM evaluation, can still be informative for evaluating LLMs. Focusing on five different NLI benchmarks across six models of different scales, we investigate if they are able to discriminate models of different size and quality and how their accuracies develop during training. Furthermore, we investigate the extent to which the softmax distributions of models align with human distributions in cases where statements are ambiguous or vague. Overall, our results paint a positive picture for the NLI tasks: we find that they are able to discriminate well between models at various stages of training, yet are not (all) saturated. Furthermore, we find that while the similarity of model distributions with human label distributions increases with scale, it is still much higher than the similarity between two populations of humans, making it a potentially interesting statistic to consider.
Related papers
- Bias Similarity Across Large Language Models [32.0365189539138]
Bias in machine learning models has been a chronic problem.
We take a comprehensive look at ten open- and closed-source Large Language Models.
We measure functional similarity to understand how biases manifest across models.
arXiv Detail & Related papers (2024-10-15T19:21:14Z) - What Languages are Easy to Language-Model? A Perspective from Learning Probabilistic Regular Languages [78.1866280652834]
Large language models (LM) are distributions over strings.
We investigate the learnability of regular LMs (RLMs) by RNN and Transformer LMs.
We find that the complexity of the RLM rank is strong and significant predictors of learnability for both RNNs and Transformers.
arXiv Detail & Related papers (2024-06-06T17:34:24Z) - Language Models can Evaluate Themselves via Probability Discrepancy [38.54454263880133]
We propose a new self-evaluation method ProbDiff for assessing the efficacy of various Large Language Models (LLMs)
It uniquely utilizes the LLMs being tested to compute the probability discrepancy between the initial response and its revised versions.
Our findings reveal that ProbDiff achieves results on par with those obtained from evaluations based on GPT-4.
arXiv Detail & Related papers (2024-05-17T03:50:28Z) - Unraveling the Dominance of Large Language Models Over Transformer Models for Bangla Natural Language Inference: A Comprehensive Study [0.0]
Natural Language Inference (NLI) is a cornerstone of Natural Language Processing (NLP)
This study addresses the underexplored area of evaluating Large Language Models (LLMs) in low-resourced languages like Bengali.
arXiv Detail & Related papers (2024-05-05T13:57:05Z) - CoAnnotating: Uncertainty-Guided Work Allocation between Human and Large
Language Models for Data Annotation [94.59630161324013]
We propose CoAnnotating, a novel paradigm for Human-LLM co-annotation of unstructured texts at scale.
Our empirical study shows CoAnnotating to be an effective means to allocate work from results on different datasets, with up to 21% performance improvement over random baseline.
arXiv Detail & Related papers (2023-10-24T08:56:49Z) - Can Large Language Models Capture Dissenting Human Voices? [7.668954669688971]
Large language models (LLMs) have shown impressive achievements in solving a broad range of tasks.
We evaluate the performance and alignment of LLM distribution with humans using two different techniques.
We show LLMs exhibit limited ability in solving NLI tasks and simultaneously fail to capture human disagreement distribution.
arXiv Detail & Related papers (2023-05-23T07:55:34Z) - On the Compositional Generalization Gap of In-Context Learning [73.09193595292233]
We look at the gap between the in-distribution (ID) and out-of-distribution (OOD) performance of such models in semantic parsing tasks with in-context learning.
We evaluate four model families, OPT, BLOOM, CodeGen and Codex on three semantic parsing datasets.
arXiv Detail & Related papers (2022-11-15T19:56:37Z) - Evaluating Distributional Distortion in Neural Language Modeling [81.83408583979745]
A heavy-tail of rare events accounts for a significant amount of the total probability mass of distributions in language.
Standard language modeling metrics such as perplexity quantify the performance of language models (LM) in aggregate.
We develop a controlled evaluation scheme which uses generative models trained on natural data as artificial languages.
arXiv Detail & Related papers (2022-03-24T01:09:46Z) - An Application of Pseudo-Log-Likelihoods to Natural Language Scoring [5.382454613390483]
A language model with relatively few parameters and training steps can outperform it on a recent large data set.
We produce some absolute state-of-the-art results for common sense reasoning in binary choice tasks.
We argue that robustness of the smaller model ought to be understood in terms of compositionality.
arXiv Detail & Related papers (2022-01-23T22:00:54Z) - Multi-timescale Representation Learning in LSTM Language Models [69.98840820213937]
Language models must capture statistical dependencies between words at timescales ranging from very short to very long.
We derived a theory for how the memory gating mechanism in long short-term memory language models can capture power law decay.
Experiments showed that LSTM language models trained on natural English text learn to approximate this theoretical distribution.
arXiv Detail & Related papers (2020-09-27T02:13:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.