Variance-Aware LLM Annotation for Strategy Research: Sources, Diagnostics, and a Protocol for Reliable Measurement
- URL: http://arxiv.org/abs/2601.02370v3
- Date: Mon, 19 Jan 2026 11:23:39 GMT
- Title: Variance-Aware LLM Annotation for Strategy Research: Sources, Diagnostics, and a Protocol for Reliable Measurement
- Authors: Arnaldo Camuffo, Alfonso Gambardella, Saeid Kazemi, Jakub Malachowski, Abhinav Pandey,
- Abstract summary: Large language models (LLMs) offer strategy researchers powerful tools for annotating text at scale.<n>But treating LLM-generated labels as deterministic overlooks substantial instability.<n>We diagnose five variance sources: construct specification, interface effects, model preferences, output extraction, and system-level aggregation.
- Score: 0.3228822469249803
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) offer strategy researchers powerful tools for annotating text at scale, but treating LLM-generated labels as deterministic overlooks substantial instability. Grounded in content analysis and generalizability theory, we diagnose five variance sources: construct specification, interface effects, model preferences, output extraction, and system-level aggregation. Empirical demonstrations show that minor design choices-prompt phrasing, model selection-can shift outcomes by 12-85 percentage points. Such variance threatens not only reproducibility but econometric identification: annotation errors correlated with covariates bias parameter estimates regardless of average accuracy. We develop a variance-aware protocol specifying sampling budgets, aggregation rules, and reporting standards, and delineate scope conditions where LLM annotation should not be used. These contributions transform LLM-based annotation from ad hoc practice into auditable measurement infrastructure.
Related papers
- VIOLA: Towards Video In-Context Learning with Minimal Annotations [20.810620293371027]
We introduce VIOLA, a framework that synergizes minimal expert supervision with abundant unlabeled data.<n>Our framework significantly outperforms various baselines in low-resource settings, achieving robust adaptation with minimal annotation costs.
arXiv Detail & Related papers (2026-01-22T00:35:30Z) - Enhancing LLM-Based Data Annotation with Error Decomposition [6.6544828402388445]
Large language models offer a scalable alternative to human coding for data annotation tasks.<n>Their performance on subjective annotation tasks is less consistent and more prone to errors.<n>We propose a diagnostic evaluation paradigm that incorporates a human-in-the-loop step to separate task-inherent ambiguity from model-driven inaccuracies.
arXiv Detail & Related papers (2026-01-17T05:43:17Z) - BiasLab: A Multilingual, Dual-Framing Framework for Robust Measurement of Output-Level Bias in Large Language Models [3.643198597030366]
This paper introduces BiasLab, an open-source, model-agnostic evaluation framework for quantifying output-level (extrinsic) bias.<n>The framework supports evaluation across diverse bias axes, including demographic, cultural, political, and geopolitical topics.
arXiv Detail & Related papers (2026-01-11T11:07:46Z) - Simple Yet Effective: An Information-Theoretic Approach to Multi-LLM Uncertainty Quantification [9.397157329808254]
MUSE is a simple information-theoretic method to identify and aggregate well-calibrated subsets of large language models.<n> Experiments on binary prediction tasks demonstrate improved calibration and predictive performance compared to single-model and na"ive ensemble baselines.
arXiv Detail & Related papers (2025-07-09T19:13:25Z) - Statistical Hypothesis Testing for Auditing Robustness in Language Models [49.1574468325115]
We introduce distribution-based perturbation analysis, a framework that reformulates perturbation analysis as a frequentist hypothesis testing problem.<n>We construct empirical null and alternative output distributions within a low-dimensional semantic similarity space via Monte Carlo sampling.<n>We show how we can quantify response changes, measure true/false positive rates, and evaluate alignment with reference models.
arXiv Detail & Related papers (2025-06-09T17:11:07Z) - To Err Is Human; To Annotate, SILICON? Reducing Measurement Error in LLM Annotation [11.470318058523466]
Large Language Models (LLMs) promise a cost-effective scalable alternative to human annotation.<n>We develop the SILICON methodology to systematically reduce measurement error from LLM annotation.<n>Our evidence indicates that reducing each error source is necessary, and that SILICON supports rigorous annotation in management research.
arXiv Detail & Related papers (2024-12-19T02:21:41Z) - Provenance: A Light-weight Fact-checker for Retrieval Augmented LLM Generation Output [49.893971654861424]
We present a light-weight approach for detecting nonfactual outputs from retrieval-augmented generation (RAG)
We compute a factuality score that can be thresholded to yield a binary decision.
Our experiments show high area under the ROC curve (AUC) across a wide range of relevant open source datasets.
arXiv Detail & Related papers (2024-11-01T20:44:59Z) - Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode.
We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z) - Regression-aware Inference with LLMs [52.764328080398805]
We show that an inference strategy can be sub-optimal for common regression and scoring evaluation metrics.
We propose alternate inference strategies that estimate the Bayes-optimal solution for regression and scoring metrics in closed-form from sampled responses.
arXiv Detail & Related papers (2024-03-07T03:24:34Z) - Decomposing Uncertainty for Large Language Models through Input Clarification Ensembling [69.83976050879318]
In large language models (LLMs), identifying sources of uncertainty is an important step toward improving reliability, trustworthiness, and interpretability.
In this paper, we introduce an uncertainty decomposition framework for LLMs, called input clarification ensembling.
Our approach generates a set of clarifications for the input, feeds them into an LLM, and ensembles the corresponding predictions.
arXiv Detail & Related papers (2023-11-15T05:58:35Z) - Model-based causal feature selection for general response types [8.228587135343071]
Invariant causal prediction (ICP) is a method for causal feature selection which requires data from heterogeneous settings.
We develop transformation-model (TRAM) based ICP, allowing for continuous, categorical, count-type, and uninformatively censored responses.
We provide an open-source R package 'tramicp' and evaluate our approach on simulated data and in a case study investigating causal features of survival in critically ill patients.
arXiv Detail & Related papers (2023-09-22T12:42:48Z) - Conformal Language Modeling [61.94417935386489]
We propose a novel approach to conformal prediction for generative language models (LMs)
Standard conformal prediction produces prediction sets with rigorous, statistical guarantees.
We demonstrate the promise of our approach on multiple tasks in open-domain question answering, text summarization, and radiology report generation.
arXiv Detail & Related papers (2023-06-16T21:55:08Z) - Meta-Causal Feature Learning for Out-of-Distribution Generalization [71.38239243414091]
This paper presents a balanced meta-causal learner (BMCL), which includes a balanced task generation module (BTG) and a meta-causal feature learning module (MCFL)
BMCL effectively identifies the class-invariant visual regions for classification and may serve as a general framework to improve the performance of the state-of-the-art methods.
arXiv Detail & Related papers (2022-08-22T09:07:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.