Related papers: Can We Hide Machines in the Crowd? Quantifying Equivalence in LLM-in-the-loop Annotation Tasks

Can We Hide Machines in the Crowd? Quantifying Equivalence in LLM-in-the-loop Annotation Tasks

URL: http://arxiv.org/abs/2510.06658v2
Date: Wed, 29 Oct 2025 03:24:46 GMT
Title: Can We Hide Machines in the Crowd? Quantifying Equivalence in LLM-in-the-loop Annotation Tasks
Authors: Jiaman He, Zikang Leng, Dana McKay, Damiano Spina, Johanne R. Trippas,
Abstract summary: We aim to explore how labeling decisions -- by both humans and LLMs -- can be statistically evaluated across individuals.<n>We develop a statistical evaluation method based on Krippendorff's $alpha$, paired bootstrapping, and the Two One-Sided t-Tests (TOST) equivalence test procedure.<n>We apply this approach to two datasets -- MovieLens 100K and PolitiFact -- and find that the LLM is statistically indistinguishable from a human annotator in the former.
Score: 8.246529401043128
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Many evaluations of large language models (LLMs) in text annotation focus primarily on the correctness of the output, typically comparing model-generated labels to human-annotated ``ground truth'' using standard performance metrics. In contrast, our study moves beyond effectiveness alone. We aim to explore how labeling decisions -- by both humans and LLMs -- can be statistically evaluated across individuals. Rather than treating LLMs purely as annotation systems, we approach LLMs as an alternative annotation mechanism that may be capable of mimicking the subjective judgments made by humans. To assess this, we develop a statistical evaluation method based on Krippendorff's $\alpha$, paired bootstrapping, and the Two One-Sided t-Tests (TOST) equivalence test procedure. This evaluation method tests whether an LLM can blend into a group of human annotators without being distinguishable. We apply this approach to two datasets -- MovieLens 100K and PolitiFact -- and find that the LLM is statistically indistinguishable from a human annotator in the former ($p = 0.004$), but not in the latter ($p = 0.155$), highlighting task-dependent differences. It also enables early evaluation on a small sample of human data to inform whether LLMs are suitable for large-scale annotation in a given application.

Related papers

Blind to the Human Touch: Overlap Bias in LLM-Based Summary Evaluation [89.52571224447111]
Large language model (LLM) judges have often been used alongside traditional, algorithm-based metrics for tasks like summarization.<n>We provide an LLM judge bias analysis as a function of overlap with human-written responses in the domain of summarization.
arXiv Detail & Related papers (2026-02-07T19:39:28Z)
Can LLMs Evaluate What They Cannot Annotate? Revisiting LLM Reliability in Hate Speech Detection [5.731621080995591]
Hate speech spreads widely online, harming individuals and communities, making automatic detection essential for large-scale moderation.<n>Part of the challenge lies in subjectivity: what one person flags as hate speech, another may see as benign.<n>Large Language Models (LLMs) promise scalable annotation, but prior studies demonstrate that they cannot fully replace human judgement.
arXiv Detail & Related papers (2025-12-10T14:00:48Z)
Just Put a Human in the Loop? Investigating LLM-Assisted Annotation for Subjective Tasks [18.695435335031355]
In subjective annotation tasks with multiple plausible answers, reviewing LLM outputs can change the label distribution.<n>We conducted a pre-registered experiment with 410 unique annotators and over 7,000 annotations.<n>We find that presenting crowdworkers with LLM-generated annotation suggestions did not make them faster, but did improve their self-reported confidence in the task.
arXiv Detail & Related papers (2025-07-21T17:29:21Z)
Can LLMs Replace Human Evaluators? An Empirical Study of LLM-as-a-Judge in Software Engineering [18.766132076075365]
Large language models (LLMs) have been deployed to tackle various software engineering (SE) tasks like code generation.<n>Pass@k metric requires extensive unit tests and configured environments, and is not suitable for evaluating LLM-generated text.<n>Conventional metrics like BLEU, which measure only lexical rather than semantic similarity, have also come under scrutiny.
arXiv Detail & Related papers (2025-02-10T06:49:29Z)
The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs [21.97227334180969]
"LLM-as-an-annotator" and "LLM-as-a-judge" paradigms employ Large Language Models (LLMs) as annotators, judges, and evaluators in tasks traditionally performed by humans.<n>Despite their role in shaping study results and insights, there is no standard or rigorous procedure to determine whether LLMs can replace human annotators.<n>We propose a novel statistical procedure, the Alternative Annotator Test (alt-test), that requires only a modest subset of annotated examples to justify using LLM annotations.
arXiv Detail & Related papers (2025-01-19T07:09:11Z)
Aligning Language Models with Demonstrated Feedback [58.834937450242975]
Demonstration ITerated Task Optimization (DITTO) directly aligns language model outputs to a user's demonstrated behaviors.<n>We evaluate DITTO's ability to learn fine-grained style and task alignment across domains such as news articles, emails, and blog posts.
arXiv Detail & Related papers (2024-06-02T23:13:56Z)
RepEval: Effective Text Evaluation with LLM Representation [55.26340302485898]
RepEval is a metric that leverages the projection of Large Language Models (LLMs) representations for evaluation. Our work underscores the richness of information regarding text quality embedded within LLM representations, offering insights for the development of new metrics.
arXiv Detail & Related papers (2024-04-30T13:50:55Z)
Likelihood-based Mitigation of Evaluation Bias in Large Language Models [29.953609131069523]
Large Language Models (LLMs) are widely used to evaluate natural language generation tasks as automated metrics.<n>It is possible that there might be a likelihood bias if LLMs are used for evaluation.
arXiv Detail & Related papers (2024-02-25T04:52:02Z)
PiCO: Peer Review in LLMs based on the Consistency Optimization [48.48819141999387]
We use peer-review mechanisms to measure large language models (LLMs) automatically.<n>We formalize it as a constrained optimization problem, intending to maximize the consistency of each LLM's capabilities and scores.<n>We propose three metrics called PEN, CIN, and LIS to evaluate the gap in aligning human rankings.
arXiv Detail & Related papers (2024-02-02T18:49:26Z)
CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark. In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship. We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z)
Using Natural Language Explanations to Rescale Human Judgments [81.66697572357477]
We propose a method to rescale ordinal annotations and explanations using large language models (LLMs)<n>We feed annotators' Likert ratings and corresponding explanations into an LLM and prompt it to produce a numeric score anchored in a scoring rubric.<n>Our method rescales the raw judgments without impacting agreement and brings the scores closer to human judgments grounded in the same scoring rubric.
arXiv Detail & Related papers (2023-05-24T06:19:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.