Do Large Language Models Perform the Way People Expect? Measuring the Human Generalization Function
- URL: http://arxiv.org/abs/2406.01382v1
- Date: Mon, 3 Jun 2024 14:45:21 GMT
- Title: Do Large Language Models Perform the Way People Expect? Measuring the Human Generalization Function
- Authors: Keyon Vafa, Ashesh Rambachan, Sendhil Mullainathan,
- Abstract summary: We evaluate large language models (LLMs) for their diversity of uses.
We consider a setting where these deployment decisions are made by people.
We collect a dataset of 19K examples of how humans make generalizations across 79 tasks from the MMLU and BIG-Bench benchmarks.
- Score: 3.7078759896522953
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: What makes large language models (LLMs) impressive is also what makes them hard to evaluate: their diversity of uses. To evaluate these models, we must understand the purposes they will be used for. We consider a setting where these deployment decisions are made by people, and in particular, people's beliefs about where an LLM will perform well. We model such beliefs as the consequence of a human generalization function: having seen what an LLM gets right or wrong, people generalize to where else it might succeed. We collect a dataset of 19K examples of how humans make generalizations across 79 tasks from the MMLU and BIG-Bench benchmarks. We show that the human generalization function can be predicted using NLP methods: people have consistent structured ways to generalize. We then evaluate LLM alignment with the human generalization function. Our results show that -- especially for cases where the cost of mistakes is high -- more capable models (e.g. GPT-4) can do worse on the instances people choose to use them for, exactly because they are not aligned with the human generalization function.
Related papers
- Large Language Models Assume People are More Rational than We Really are [10.857040292234984]
In order for AI to communicate effectively with people, they must understand how we make decisions.
Previous empirical evidence seems to suggest that these implicit models are accurate.
We find that this is actually not the case when both simulating and predicting people's choices.
arXiv Detail & Related papers (2024-06-24T18:15:27Z) - Large Language Models Must Be Taught to Know What They Don't Know [97.90008709512921]
We show that fine-tuning on a small dataset of correct and incorrect answers can create an uncertainty estimate with good generalization and small computational overhead.
We also investigate the mechanisms that enable reliable uncertainty estimation, finding that many models can be used as general-purpose uncertainty estimators.
arXiv Detail & Related papers (2024-06-12T16:41:31Z) - Pride and Prejudice: LLM Amplifies Self-Bias in Self-Refinement [75.7148545929689]
Large language models (LLMs) improve their performance through self-feedback on certain tasks while degrade on others.
We formally define LLM's self-bias - the tendency to favor its own generation.
We analyze six LLMs on translation, constrained text generation, and mathematical reasoning tasks.
arXiv Detail & Related papers (2024-02-18T03:10:39Z) - KTO: Model Alignment as Prospect Theoretic Optimization [67.44320255397506]
Kahneman & Tversky's $textitprospect theory$ tells us that humans perceive random variables in a biased but well-defined manner.
We show that objectives for aligning LLMs with human feedback implicitly incorporate many of these biases.
We propose a HALO that directly maximizes the utility of generations instead of maximizing the log-likelihood of preferences.
arXiv Detail & Related papers (2024-02-02T10:53:36Z) - Do LLMs exhibit human-like response biases? A case study in survey
design [66.1850490474361]
We investigate the extent to which large language models (LLMs) reflect human response biases, if at all.
We design a dataset and framework to evaluate whether LLMs exhibit human-like response biases in survey questionnaires.
Our comprehensive evaluation of nine models shows that popular open and commercial LLMs generally fail to reflect human-like behavior.
arXiv Detail & Related papers (2023-11-07T15:40:43Z) - Do Models Explain Themselves? Counterfactual Simulatability of Natural
Language Explanations [62.61495090463084]
Large language models (LLMs) are trained to imitate humans to explain human decisions.
We evaluate whether an explanation can enable humans to precisely infer the model's outputs on diverse counterfactuals.
We found that LLM's explanations have low precision and that precision does not correlate with plausibility.
arXiv Detail & Related papers (2023-07-17T17:41:47Z) - The Larger They Are, the Harder They Fail: Language Models do not
Recognize Identifier Swaps in Python [34.13276581200455]
Large Language Models (LLMs) have successfully been applied to code generation tasks.
We show that LLMs fail to properly generate correct Python code when default function names are swapped.
Some of them even become more confident in their incorrect predictions as the model size increases.
arXiv Detail & Related papers (2023-05-24T18:54:39Z) - Can Large Language Models Capture Dissenting Human Voices? [7.668954669688971]
Large language models (LLMs) have shown impressive achievements in solving a broad range of tasks.
We evaluate the performance and alignment of LLM distribution with humans using two different techniques.
We show LLMs exhibit limited ability in solving NLI tasks and simultaneously fail to capture human disagreement distribution.
arXiv Detail & Related papers (2023-05-23T07:55:34Z) - Benchmarking Large Language Models for News Summarization [79.37850439866938]
Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood.
We find instruction tuning, and not model size, is the key to the LLM's zero-shot summarization capability.
arXiv Detail & Related papers (2023-01-31T18:46:19Z) - Despite "super-human" performance, current LLMs are unsuited for
decisions about ethics and safety [0.0]
We provide a simple new prompting strategy that leads to yet another supposedly "super-human" result.
We find that relying on average performance to judge capabilities can be highly misleading.
We also observe signs of inverse scaling with model size on some examples, and show that prompting models to "explain their reasoning" often leads to alarming justifications of unethical actions.
arXiv Detail & Related papers (2022-12-13T00:29:45Z) - The Goldilocks of Pragmatic Understanding: Fine-Tuning Strategy Matters
for Implicature Resolution by LLMs [26.118193748582197]
We evaluate four categories of widely used state-of-the-art models.
We find that, despite only evaluating on utterances that require a binary inference, models in three of these categories perform close to random.
These results suggest that certain fine-tuning strategies are far better at inducing pragmatic understanding in models.
arXiv Detail & Related papers (2022-10-26T19:04:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.