Fugu-MT 論文翻訳(概要): Can We Hide Machines in the Crowd? Quantifying Equivalence in LLM-in-the-loop Annotation Tasks

論文の概要: Can We Hide Machines in the Crowd? Quantifying Equivalence in LLM-in-the-loop Annotation Tasks

arxiv url: http://arxiv.org/abs/2510.06658v1
Date: Wed, 08 Oct 2025 05:17:33 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-09 16:41:20.310458
Title: Can We Hide Machines in the Crowd? Quantifying Equivalence in LLM-in-the-loop Annotation Tasks
Title（参考訳）: 群衆にマシンを隠せるか? LLM-in-the-loop アノテーションにおける等価性の定量化
Authors: Jiaman He, Zikang Leng, Dana McKay, Damiano Spina, Johanne R. Trippas,
Abstract要約: 我々は、人間とLLMの両方によるラベル付け決定が、個人間で統計的に評価される方法について検討することを目指している。 Krippendorffの$alpha$, paired bootstrapping, and the Two One-Sided t-Tests (TOST) equivalence test procedureに基づく統計的評価法を開発した。このアプローチをMovieLens 100K と PolitiFact という2つのデータセットに適用すると、LCM が前者の人間のアノテーションと統計的に区別できないことが分かる。
参考スコア（独自算出の注目度）: 8.246529401043128
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Many evaluations of large language models (LLMs) in text annotation focus primarily on the correctness of the output, typically comparing model-generated labels to human-annotated ``ground truth'' using standard performance metrics. In contrast, our study moves beyond effectiveness alone. We aim to explore how labeling decisions -- by both humans and LLMs -- can be statistically evaluated across individuals. Rather than treating LLMs purely as annotation systems, we approach LLMs as an alternative annotation mechanism that may be capable of mimicking the subjective judgments made by humans. To assess this, we develop a statistical evaluation method based on Krippendorff's $\alpha$, paired bootstrapping, and the Two One-Sided t-Tests (TOST) equivalence test procedure. This evaluation method tests whether an LLM can blend into a group of human annotators without being distinguishable. We apply this approach to two datasets -- MovieLens 100K and PolitiFact -- and find that the LLM is statistically indistinguishable from a human annotator in the former ($p = 0.004$), but not in the latter ($p = 0.155$), highlighting task-dependent differences. It also enables early evaluation on a small sample of human data to inform whether LLMs are suitable for large-scale annotation in a given application.
Abstract（参考訳）: テキストアノテーションにおける大規模言語モデル (LLM) の評価の多くは、主に出力の正確さに焦点を当てており、モデル生成ラベルと標準的なパフォーマンス指標を用いた人間の注釈付き ‘ground truth'' との比較が一般的である。対照的に、我々の研究は効果のみに留まらない。我々は、人間とLLMの両方によるラベル付け決定が、個人間で統計的に評価される方法について検討することを目指している。 LLMを純粋にアノテーションシステムとして扱うのではなく、人間の主観的判断を模倣できる代替的なアノテーションメカニズムとしてLLMにアプローチする。これを評価するため,Krippendorff の $\alpha$, paired bootstrapping および Two One-Sided t-Tests (TOST) 等価性試験法に基づく統計的評価法を開発した。この評価法は、LLMが識別不能なヒトアノテータのグループにブレンドできるかどうかを検査する。このアプローチをMovieLens 100K と PolitiFact という2つのデータセットに適用すると、LLM は以前の (p = 0.004$) のアノテータと統計的に区別できないが、後者 (p = 0.155$) では、タスク依存の違いを強調している。また、人間の小さなサンプルを早期に評価し、LLMが与えられたアプリケーションで大規模なアノテーションに適しているかどうかを知らせる。

論文の概要: Can We Hide Machines in the Crowd? Quantifying Equivalence in LLM-in-the-loop Annotation Tasks

関連論文リスト