Related papers: Can We Count on LLMs? The Fixed-Effect Fallacy and Claims of GPT-4 Capabilities

Can We Count on LLMs? The Fixed-Effect Fallacy and Claims of GPT-4 Capabilities

URL: http://arxiv.org/abs/2409.07638v2
Date: Tue, 24 Sep 2024 17:34:07 GMT
Title: Can We Count on LLMs? The Fixed-Effect Fallacy and Claims of GPT-4 Capabilities
Authors: Thomas Ball, Shuo Chen, Cormac Herley,
Abstract summary: We present measurements of GPT-4 performance on several deterministic tasks. We find that seemingly trivial modifications in the task-prompt or input population can yield differences far larger than can be explained by sampling effects.
Score: 8.1022073999821
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper we explore evaluation of LLM capabilities. We present measurements of GPT-4 performance on several deterministic tasks; each task involves a basic calculation and takes as input parameter some element drawn from a large well-defined population (e.g., count elements in a list, multiply two k-digit numbers, etc). We examine several conditions per-task and perform enough trials so that statistically significant differences can be detected. This allows us to investigate the sensitivity of task-accuracy both to query phrasing and input parameter population. We find that seemingly trivial modifications in the task-prompt or input population can yield differences far larger than can be explained by sampling effects. For example, performance on a simple list-counting task varies with query-phrasing and list-length, but also with list composition (i.e., the thing-to-be-counted) and object frequency (e.g., success when an element accounts for $\approx$ 50\% of a list is different from when it accounts for $\approx$ 70\% etc). We conclude that efforts to quantify LLM capabilities easily succumb to the language-as-fixed-effect fallacy, where experimental observations are improperly generalized beyond what the data supports. A consequence appears to be that intuitions that have been formed based on interactions with humans form a very unreliable guide as to which input modifications should ``make no difference'' to LLM performance.

Related papers

On the Effectiveness of LLM-as-a-judge for Code Generation and Summarization [54.965787768076254]
Large Language Models have been recently exploited as judges for complex natural language processing tasks, such as Q&A.<n>We study the effectiveness of LLMs-as-a-judge for two code-related tasks, namely code generation and code summarization.
arXiv Detail & Related papers (2025-07-22T13:40:26Z)
Rank It, Then Ask It: Input Reranking for Maximizing the Performance of LLMs on Symmetric Tasks [9.867695275243879]
Large language models (LLMs) have quickly emerged as practical and versatile tools. We consider the application of LLMs on symmetric tasks where a query is asked on an (unordered) bag of elements.
arXiv Detail & Related papers (2024-11-30T17:39:59Z)
What Did I Do Wrong? Quantifying LLMs' Sensitivity and Consistency to Prompt Engineering [8.019873464066308]
We introduce two metrics for classification tasks, namely sensitivity and consistency. sensitivity measures changes of predictions across rephrasings of the prompt. Instead, consistency measures how predictions vary across rephrasings for elements of the same class.
arXiv Detail & Related papers (2024-06-18T06:59:24Z)
Prompt Design Matters for Computational Social Science Tasks but in Unpredictable Ways [3.779027297957693]
We test how prompt design impacts the compliance and accuracy of social science annotations. Our results show that LLM compliance and accuracy are highly prompt-dependent. This work serves as both a warning and practical guide for researchers and practitioners.
arXiv Detail & Related papers (2024-06-17T18:01:43Z)
Are you still on track!? Catching LLM Task Drift with Activations [55.75645403965326]
Task drift allows attackers to exfiltrate data or influence the LLM's output for other users. We show that a simple linear classifier can detect drift with near-perfect ROC AUC on an out-of-distribution test set. We observe that this approach generalizes surprisingly well to unseen task domains, such as prompt injections, jailbreaks, and malicious instructions.
arXiv Detail & Related papers (2024-06-02T16:53:21Z)
RankPrompt: Step-by-Step Comparisons Make Language Models Better Reasoners [38.30539869264287]
Large Language Models (LLMs) have achieved impressive performance across various reasoning tasks. However, even state-of-the-art LLMs such as ChatGPT are prone to logical errors during their reasoning processes. We introduce RankPrompt, a new prompting method that enables LLMs to self-rank their responses without additional resources.
arXiv Detail & Related papers (2024-03-19T02:34:18Z)
Not All Layers of LLMs Are Necessary During Inference [68.88671495401483]
We show that for some tasks, Large Language Models can achieve results comparable to the final output at some intermediate layers. We propose a simple yet effective algorithm named AdaInfer to adaptively terminate the inference process for an input instance.
arXiv Detail & Related papers (2024-03-04T16:23:58Z)
Can LLMs perform structured graph reasoning? [4.676784872259775]
Pretrained Large Language Models (LLMs) have demonstrated various reasoning capabilities through language-based prompts alone. We design various graph reasoning tasks as a proxy to semi-structured tasks in this paper. We benchmark 5 different instruct-finetuned LLMs (GPT-4, GPT-3.5, Claude-2, Llama-2 and Palm-2) on the aforementioned tasks.
arXiv Detail & Related papers (2024-02-02T09:45:33Z)
Evaluating Gender Bias in Large Language Models via Chain-of-Thought Prompting [87.30837365008931]
Large language models (LLMs) equipped with Chain-of-Thought (CoT) prompting are able to make accurate incremental predictions even on unscalable tasks. This study examines the impact of LLMs' step-by-step predictions on gender bias in unscalable tasks.
arXiv Detail & Related papers (2024-01-28T06:50:10Z)
PPTC Benchmark: Evaluating Large Language Models for PowerPoint Task Completion [96.47420221442397]
We introduce the PowerPoint Task Completion benchmark to assess the ability of Large Language Models to finish multi-turn, multi-modal instructions. We also propose the PPTX-Match Evaluation System that evaluates if LLMs finish the instruction based on the prediction file rather than the label API sequence. The results show that GPT-4 outperforms other LLMs with 75.1% accuracy in single-turn dialogue testing but faces challenges in completing entire sessions, achieving just 6% session accuracy.
arXiv Detail & Related papers (2023-11-03T08:06:35Z)
Can Large Language Models Infer Causation from Correlation? [104.96351414570239]
We test the pure causal inference skills of large language models (LLMs) We formulate a novel task Corr2Cause, which takes a set of correlational statements and determines the causal relationship between the variables. We show that these models achieve almost close to random performance on the task.
arXiv Detail & Related papers (2023-06-09T12:09:15Z)
Statistical Knowledge Assessment for Large Language Models [79.07989821512128]
Given varying prompts regarding a factoid question, can a large language model (LLM) reliably generate factually correct answers? We propose KaRR, a statistical approach to assess factual knowledge for LLMs. Our results reveal that the knowledge in LLMs with the same backbone architecture adheres to the scaling law, while tuning on instruction-following data sometimes compromises the model's capability to generate factually correct text reliably.
arXiv Detail & Related papers (2023-05-17T18:54:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.