Related papers: Do Large Language Models Perform the Way People Expect? Measuring the Human Generalization Function

Do Large Language Models Perform the Way People Expect? Measuring the Human Generalization Function

URL: http://arxiv.org/abs/2406.01382v1
Date: Mon, 3 Jun 2024 14:45:21 GMT
Title: Do Large Language Models Perform the Way People Expect? Measuring the Human Generalization Function
Authors: Keyon Vafa, Ashesh Rambachan, Sendhil Mullainathan,
Abstract summary: We evaluate large language models (LLMs) for their diversity of uses. We consider a setting where these deployment decisions are made by people. We collect a dataset of 19K examples of how humans make generalizations across 79 tasks from the MMLU and BIG-Bench benchmarks.
Score: 3.7078759896522953
License: http://creativecommons.org/licenses/by/4.0/
Abstract: What makes large language models (LLMs) impressive is also what makes them hard to evaluate: their diversity of uses. To evaluate these models, we must understand the purposes they will be used for. We consider a setting where these deployment decisions are made by people, and in particular, people's beliefs about where an LLM will perform well. We model such beliefs as the consequence of a human generalization function: having seen what an LLM gets right or wrong, people generalize to where else it might succeed. We collect a dataset of 19K examples of how humans make generalizations across 79 tasks from the MMLU and BIG-Bench benchmarks. We show that the human generalization function can be predicted using NLP methods: people have consistent structured ways to generalize. We then evaluate LLM alignment with the human generalization function. Our results show that -- especially for cases where the cost of mistakes is high -- more capable models (e.g. GPT-4) can do worse on the instances people choose to use them for, exactly because they are not aligned with the human generalization function.

Related papers

Bayesian Teaching Enables Probabilistic Reasoning in Large Language Models [50.16340812031201]
We show that large language models (LLMs) do not update their beliefs as expected from the Bayesian framework. We teach the LLMs to reason in a Bayesian manner by training them to mimic the predictions of an optimal Bayesian model.
arXiv Detail & Related papers (2025-03-21T20:13:04Z)
Probing LLM World Models: Enhancing Guesstimation with Wisdom of Crowds Decoding [42.35821271298182]
We introduce a novel guesstimation dataset, MARBLES. This dataset requires one to estimate how many items can fit into containers. We propose WOC decoding'' strategy for LLM guesstimation.
arXiv Detail & Related papers (2025-01-28T21:43:56Z)
Large Language Models Assume People are More Rational than We Really are [10.857040292234984]
In order for AI to communicate effectively with people, they must understand how we make decisions. Previous empirical evidence seems to suggest that these implicit models are accurate. We find that this is actually not the case when both simulating and predicting people's choices.
arXiv Detail & Related papers (2024-06-24T18:15:27Z)
Large Language Models Must Be Taught to Know What They Don't Know [97.90008709512921]
We show that fine-tuning on a small dataset of correct and incorrect answers can create an uncertainty estimate with good generalization and small computational overhead. We also investigate the mechanisms that enable reliable uncertainty estimation, finding that many models can be used as general-purpose uncertainty estimators.
arXiv Detail & Related papers (2024-06-12T16:41:31Z)
Pride and Prejudice: LLM Amplifies Self-Bias in Self-Refinement [75.7148545929689]
Large language models (LLMs) improve their performance through self-feedback on certain tasks while degrade on others. We formally define LLM's self-bias - the tendency to favor its own generation. We analyze six LLMs on translation, constrained text generation, and mathematical reasoning tasks.
arXiv Detail & Related papers (2024-02-18T03:10:39Z)
KTO: Model Alignment as Prospect Theoretic Optimization [67.44320255397506]
Kahneman & Tversky's $textitprospect theory$ tells us that humans perceive random variables in a biased but well-defined manner. We show that objectives for aligning LLMs with human feedback implicitly incorporate many of these biases. We propose a HALO that directly maximizes the utility of generations instead of maximizing the log-likelihood of preferences.
arXiv Detail & Related papers (2024-02-02T10:53:36Z)
Do LLMs exhibit human-like response biases? A case study in survey design [66.1850490474361]
We investigate the extent to which large language models (LLMs) reflect human response biases, if at all. We design a dataset and framework to evaluate whether LLMs exhibit human-like response biases in survey questionnaires. Our comprehensive evaluation of nine models shows that popular open and commercial LLMs generally fail to reflect human-like behavior.
arXiv Detail & Related papers (2023-11-07T15:40:43Z)
Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations [62.61495090463084]
Large language models (LLMs) are trained to imitate humans to explain human decisions. We evaluate whether an explanation can enable humans to precisely infer the model's outputs on diverse counterfactuals. We found that LLM's explanations have low precision and that precision does not correlate with plausibility.
arXiv Detail & Related papers (2023-07-17T17:41:47Z)
The Larger They Are, the Harder They Fail: Language Models do not Recognize Identifier Swaps in Python [34.13276581200455]
Large Language Models (LLMs) have successfully been applied to code generation tasks. We show that LLMs fail to properly generate correct Python code when default function names are swapped. Some of them even become more confident in their incorrect predictions as the model size increases.
arXiv Detail & Related papers (2023-05-24T18:54:39Z)
Can Large Language Models Capture Dissenting Human Voices? [7.668954669688971]
Large language models (LLMs) have shown impressive achievements in solving a broad range of tasks. We evaluate the performance and alignment of LLM distribution with humans using two different techniques. We show LLMs exhibit limited ability in solving NLI tasks and simultaneously fail to capture human disagreement distribution.
arXiv Detail & Related papers (2023-05-23T07:55:34Z)
Benchmarking Large Language Models for News Summarization [79.37850439866938]
Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood. We find instruction tuning, and not model size, is the key to the LLM's zero-shot summarization capability.
arXiv Detail & Related papers (2023-01-31T18:46:19Z)
Despite "super-human" performance, current LLMs are unsuited for decisions about ethics and safety [0.0]
We provide a simple new prompting strategy that leads to yet another supposedly "super-human" result. We find that relying on average performance to judge capabilities can be highly misleading. We also observe signs of inverse scaling with model size on some examples, and show that prompting models to "explain their reasoning" often leads to alarming justifications of unethical actions.
arXiv Detail & Related papers (2022-12-13T00:29:45Z)
The Goldilocks of Pragmatic Understanding: Fine-Tuning Strategy Matters for Implicature Resolution by LLMs [26.118193748582197]
We evaluate four categories of widely used state-of-the-art models. We find that, despite only evaluating on utterances that require a binary inference, models in three of these categories perform close to random. These results suggest that certain fine-tuning strategies are far better at inducing pragmatic understanding in models.
arXiv Detail & Related papers (2022-10-26T19:04:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.