Related papers: Large Language Models: An Applied Econometric Framework

Large Language Models: An Applied Econometric Framework

URL: http://arxiv.org/abs/2412.07031v2
Date: Fri, 03 Jan 2025 14:19:58 GMT
Title: Large Language Models: An Applied Econometric Framework
Authors: Jens Ludwig, Sendhil Mullainathan, Ashesh Rambachan,
Abstract summary: We develop an econometric framework to answer this question.<n>Using LLMs for prediction problems is valid under one condition: no leakage'' between the LLM's training dataset and the researcher's sample.<n>We find that these requirements are stringent; when they are violated, the limitations of LLMs now result in unreliable empirical estimates.
Score: 1.348318541691744
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: How can we use the novel capacities of large language models (LLMs) in empirical research? And how can we do so while accounting for their limitations, which are themselves only poorly understood? We develop an econometric framework to answer this question that distinguishes between two types of empirical tasks. Using LLMs for prediction problems (including hypothesis generation) is valid under one condition: no ``leakage'' between the LLM's training dataset and the researcher's sample. No leakage can be ensured by using open-source LLMs with documented training data and published weights. Using LLM outputs for estimation problems to automate the measurement of some economic concept (expressed either by some text or from human subjects) requires the researcher to collect at least some validation data: without such data, the errors of the LLM's automation cannot be assessed and accounted for. As long as these steps are taken, LLM outputs can be used in empirical research with the familiar econometric guarantees we desire. Using two illustrative applications to finance and political economy, we find that these requirements are stringent; when they are violated, the limitations of LLMs now result in unreliable empirical estimates. Our results suggest the excitement around the empirical uses of LLMs is warranted -- they allow researchers to effectively use even small amounts of language data for both prediction and estimation -- but only with these safeguards in place.

Related papers

Demo: Statistically Significant Results On Biases and Errors of LLMs Do Not Guarantee Generalizable Results [10.858989372235657]
We develop an infrastructure that automatically generates queries to probe LLMs and 2) evaluates answers to these queries using multiple LLM-as-a-judge setups and prompts.<n>As a baseline study, we perform two case studies on inter-LLM agreement and the impact of varying the answering and evaluation LLMs.
arXiv Detail & Related papers (2025-11-04T04:20:33Z)
Realizing LLMs' Causal Potential Requires Science-Grounded, Novel Benchmarks [20.409472830397455]
Recent claims of strong performance by Large Language Models (LLMs) on causal discovery are undermined by a key flaw: many evaluations rely on benchmarks likely included in pretraining corpora.<n>We challenge this narrative by asking: Do LLMs truly reason about causal structure, and how can we measure it without memorization concerns?<n>We argue that realizing LLMs' potential for causal analysis requires two shifts: (P.1) developing robust evaluation protocols based on recent scientific studies to guard against dataset leakage, and (P.2) designing hybrid methods that combine LLM-derived knowledge with data-driven statistics.
arXiv Detail & Related papers (2025-10-18T14:58:04Z)
Verifying the Verifiers: Unveiling Pitfalls and Potentials in Fact Verifiers [59.168391398830515]
We evaluate 12 pre-trained LLMs and one specialized fact-verifier, using a collection of examples from 14 fact-checking benchmarks.<n>We highlight the importance of addressing annotation errors and ambiguity in datasets.<n> frontier LLMs with few-shot in-context examples, often overlooked in previous works, achieve top-tier performance.
arXiv Detail & Related papers (2025-06-16T10:32:10Z)
An Empirical Study of Many-to-Many Summarization with Large Language Models [82.10000188179168]
Large language models (LLMs) have shown strong multi-lingual abilities, giving them the potential to perform Many-to-many summarization (M2MS) in real applications.<n>This work presents a systematic empirical study on LLMs' M2MS ability.
arXiv Detail & Related papers (2025-05-19T11:18:54Z)
Simulating Tabular Datasets through LLMs to Rapidly Explore Hypotheses about Real-World Entities [9.235910374587734]
This paper explores the potential to quickly prototype hypotheses through applying LLMs to estimate properties of concrete entities. The hope is to allow sifting through hypotheses more quickly through collaboration between human and machine.
arXiv Detail & Related papers (2024-11-27T05:48:44Z)
LLM-Assisted Relevance Assessments: When Should We Ask LLMs for Help? [20.998805709422292]
Test collections are information-retrieval tools that allow researchers to quickly and easily evaluate ranking algorithms.<n>As a cheaper alternative, recent studies have proposed using large language models (LLMs) to completely replace human assessors.<n>We propose LARA, an effective method to balance manual annotations with LLM annotations, helping build a rich and reliable test collection even under a low budget.
arXiv Detail & Related papers (2024-11-11T11:17:35Z)
SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts. We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM. We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z)
Large Language Models Must Be Taught to Know What They Don't Know [97.90008709512921]
We show that fine-tuning on a small dataset of correct and incorrect answers can create an uncertainty estimate with good generalization and small computational overhead. We also investigate the mechanisms that enable reliable uncertainty estimation, finding that many models can be used as general-purpose uncertainty estimators.
arXiv Detail & Related papers (2024-06-12T16:41:31Z)
Insights from Social Shaping Theory: The Appropriation of Large Language Models in an Undergraduate Programming Course [0.9718746651638346]
Large language models (LLMs) can generate, debug, and explain code. Our study explores how students' social perceptions influence their own LLM usage.
arXiv Detail & Related papers (2024-06-10T16:40:14Z)
$\forall$uto$\exists$val: Autonomous Assessment of LLMs in Formal Synthesis and Interpretation Tasks [21.12437562185667]
This paper presents a new approach for scaling LLM assessment in translating formal syntax to natural language. We use context-free grammars (CFGs) to generate out-of-distribution datasets on the fly. We also conduct an assessment of several SOTA closed and open-source LLMs to showcase the feasibility and scalability of this paradigm.
arXiv Detail & Related papers (2024-03-27T08:08:00Z)
Can LLMs Separate Instructions From Data? And What Do We Even Mean By That? [60.50127555651554]
Large Language Models (LLMs) show impressive results in numerous practical applications, but they lack essential safety features. This makes them vulnerable to manipulations such as indirect prompt injections and generally unsuitable for safety-critical tasks. We introduce a formal measure for instruction-data separation and an empirical variant that is calculable from a model's outputs.
arXiv Detail & Related papers (2024-03-11T15:48:56Z)
Characterizing Truthfulness in Large Language Model Generations with Local Intrinsic Dimension [63.330262740414646]
We study how to characterize and predict the truthfulness of texts generated from large language models (LLMs) We suggest investigating internal activations and quantifying LLM's truthfulness using the local intrinsic dimension (LID) of model activations.
arXiv Detail & Related papers (2024-02-28T04:56:21Z)
Small Models, Big Insights: Leveraging Slim Proxy Models To Decide When and What to Retrieve for LLMs [60.40396361115776]
This paper introduces a novel collaborative approach, namely SlimPLM, that detects missing knowledge in large language models (LLMs) with a slim proxy model. We employ a proxy model which has far fewer parameters, and take its answers as answers. Heuristic answers are then utilized to predict the knowledge required to answer the user question, as well as the known and unknown knowledge within the LLM.
arXiv Detail & Related papers (2024-02-19T11:11:08Z)
LM-Polygraph: Uncertainty Estimation for Language Models [71.21409522341482]
Uncertainty estimation (UE) methods are one path to safer, more responsible, and more effective use of large language models (LLMs) We introduce LM-Polygraph, a framework with implementations of a battery of state-of-the-art UE methods for LLMs in text generation tasks, with unified program interfaces in Python. It introduces an extendable benchmark for consistent evaluation of UE techniques by researchers, and a demo web application that enriches the standard chat dialog with confidence scores.
arXiv Detail & Related papers (2023-11-13T15:08:59Z)
Prevalence and prevention of large language model use in crowd work [11.554258761785512]
We show that the use of large language models (LLMs) is prevalent among crowd workers. We show that targeted mitigation strategies can significantly reduce, but not eliminate, LLM use.
arXiv Detail & Related papers (2023-10-24T09:52:09Z)
Can Large Language Models Infer Causation from Correlation? [104.96351414570239]
We test the pure causal inference skills of large language models (LLMs) We formulate a novel task Corr2Cause, which takes a set of correlational statements and determines the causal relationship between the variables. We show that these models achieve almost close to random performance on the task.
arXiv Detail & Related papers (2023-06-09T12:09:15Z)
Statistical Knowledge Assessment for Large Language Models [79.07989821512128]
Given varying prompts regarding a factoid question, can a large language model (LLM) reliably generate factually correct answers? We propose KaRR, a statistical approach to assess factual knowledge for LLMs. Our results reveal that the knowledge in LLMs with the same backbone architecture adheres to the scaling law, while tuning on instruction-following data sometimes compromises the model's capability to generate factually correct text reliably.
arXiv Detail & Related papers (2023-05-17T18:54:37Z)
Assessing Hidden Risks of LLMs: An Empirical Study on Robustness, Consistency, and Credibility [37.682136465784254]
We conduct over a million queries to the mainstream large language models (LLMs) including ChatGPT, LLaMA, and OPT. We find that ChatGPT is still capable to yield the correct answer even when the input is polluted at an extreme level. We propose a novel index associated with a dataset that roughly decides the feasibility of using such data for LLM-involved evaluation.
arXiv Detail & Related papers (2023-05-15T15:44:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.