Related papers: Can Unconfident LLM Annotations Be Used for Confident Conclusions?

Can Unconfident LLM Annotations Be Used for Confident Conclusions?

URL: http://arxiv.org/abs/2408.15204v1
Date: Tue, 27 Aug 2024 17:03:18 GMT
Title: Can Unconfident LLM Annotations Be Used for Confident Conclusions?
Authors: Kristina Gligorić, Tijana Zrnic, Cinoo Lee, Emmanuel J. Candès, Dan Jurafsky,
Abstract summary: Large language models (LLMs) have shown high agreement with human raters across a variety of tasks. We introduce Confidence-Driven Inference: a method that combines LLM confidence indicators to strategically select which human annotations should be collected.
Score: 34.23823544208315
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have shown high agreement with human raters across a variety of tasks, demonstrating potential to ease the challenges of human data collection. In computational social science (CSS), researchers are increasingly leveraging LLM annotations to complement slow and expensive human annotations. Still, guidelines for collecting and using LLM annotations, without compromising the validity of downstream conclusions, remain limited. We introduce Confidence-Driven Inference: a method that combines LLM annotations and LLM confidence indicators to strategically select which human annotations should be collected, with the goal of producing accurate statistical estimates and provably valid confidence intervals while reducing the number of human annotations needed. Our approach comes with safeguards against LLM annotations of poor quality, guaranteeing that the conclusions will be both valid and no less accurate than if we only relied on human annotations. We demonstrate the effectiveness of Confidence-Driven Inference over baselines in statistical estimation tasks across three CSS settings--text politeness, stance, and bias--reducing the needed number of human annotations by over 25% in each. Although we use CSS settings for demonstration, Confidence-Driven Inference can be used to estimate most standard quantities across a broad range of NLP problems.

Related papers

Prompt Candidates, then Distill: A Teacher-Student Framework for LLM-driven Data Annotation [35.1208076670736]
We propose a novel candidate annotation paradigm wherein large language models are encouraged to output all possible labels when incurring uncertainty.<n>To ensure unique labels are provided for downstream tasks, we develop a teacher-student framework CanDist that distills candidate annotations with a Small Language Model.
arXiv Detail & Related papers (2025-06-04T11:42:37Z)
Ensemble of Large Language Models for Curated Labeling and Rating of Free-text Data [1.715270928578365]
We propose a framework to enhance the labeling of predetermined topics in free-text data under privacy constraints. We evaluated the ensemble approach using both publicly accessible Reddit data from eating disorder related forums, and free-text responses from eating disorder patients.
arXiv Detail & Related papers (2025-01-14T20:08:16Z)
On Verbalized Confidence Scores for LLMs [25.160810008907397]
Uncertainty quantification for large language models (LLMs) can establish more human trust into their responses. This work focuses on asking the LLM itself to verbalize its uncertainty with a confidence score as part of its output tokens. We assess the reliability of verbalized confidence scores with respect to different datasets, models, and prompt methods.
arXiv Detail & Related papers (2024-12-19T11:10:36Z)
Are We There Yet? Revealing the Risks of Utilizing Large Language Models in Scholarly Peer Review [66.73247554182376]
Large language models (LLMs) have led to their integration into peer review. The unchecked adoption of LLMs poses significant risks to the integrity of the peer review system. We show that manipulating 5% of the reviews could potentially cause 12% of the papers to lose their position in the top 30% rankings.
arXiv Detail & Related papers (2024-12-02T16:55:03Z)
Mitigating Biases to Embrace Diversity: A Comprehensive Annotation Benchmark for Toxic Language [0.0]
This study introduces a prescriptive annotation benchmark grounded in humanities research to ensure consistent, unbiased labeling of offensive language. We contribute two newly annotated datasets that achieve higher inter-annotator agreement between human and language model (LLM) annotations.
arXiv Detail & Related papers (2024-10-17T08:10:24Z)
Advancing Annotation of Stance in Social Media Posts: A Comparative Analysis of Large Language Models and Crowd Sourcing [2.936331223824117]
Large Language Models (LLMs) for automated text annotation in social media posts has garnered significant interest. We analyze the performance of eight open-source and proprietary LLMs for annotating the stance expressed in social media posts. A significant finding of our study is that the explicitness of text expressing a stance plays a critical role in how faithfully LLMs' stance judgments match humans'
arXiv Detail & Related papers (2024-06-11T17:26:07Z)
CLAMBER: A Benchmark of Identifying and Clarifying Ambiguous Information Needs in Large Language Models [60.59638232596912]
We introduce CLAMBER, a benchmark for evaluating large language models (LLMs) Building upon the taxonomy, we construct 12K high-quality data to assess the strengths, weaknesses, and potential risks of various off-the-shelf LLMs. Our findings indicate the limited practical utility of current LLMs in identifying and clarifying ambiguous user queries.
arXiv Detail & Related papers (2024-05-20T14:34:01Z)
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models [52.98743860365194]
We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN) At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself. This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents.
arXiv Detail & Related papers (2024-01-02T18:53:13Z)
CoAnnotating: Uncertainty-Guided Work Allocation between Human and Large Language Models for Data Annotation [94.59630161324013]
We propose CoAnnotating, a novel paradigm for Human-LLM co-annotation of unstructured texts at scale. Our empirical study shows CoAnnotating to be an effective means to allocate work from results on different datasets, with up to 21% performance improvement over random baseline.
arXiv Detail & Related papers (2023-10-24T08:56:49Z)
Using Large Language Models for Qualitative Analysis can Introduce Serious Bias [0.09208007322096534]
Large Language Models (LLMs) are quickly becoming ubiquitous, but the implications for social science research are not yet well understood. This paper asks whether LLMs can help us analyse large-N qualitative data from open-ended interviews, with an application to transcripts of interviews with Rohingya refugees in Cox's Bazaar, Bangladesh. We find that a great deal of caution is needed in using LLMs to annotate text as there is a risk of introducing biases that can lead to misleading inferences.
arXiv Detail & Related papers (2023-09-29T11:19:15Z)
Evaluation of Faithfulness Using the Longest Supported Subsequence [52.27522262537075]
We introduce a novel approach to evaluate faithfulness of machine-generated text by computing the longest noncontinuous of the claim that is supported by the context. Using a new human-annotated dataset, we finetune a model to generate Longest Supported Subsequence (LSS) Our proposed metric demonstrates an 18% enhancement over the prevailing state-of-the-art metric for faithfulness on our dataset.
arXiv Detail & Related papers (2023-08-23T14:18:44Z)
Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs [60.61002524947733]
Previous confidence elicitation methods rely on white-box access to internal model information or model fine-tuning. This leads to a growing need to explore the untapped area of black-box approaches for uncertainty estimation. We define a systematic framework with three components: prompting strategies for eliciting verbalized confidence, sampling methods for generating multiple responses, and aggregation techniques for computing consistency.
arXiv Detail & Related papers (2023-06-22T17:31:44Z)
Benchmarking Large Language Models for News Summarization [79.37850439866938]
Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood. We find instruction tuning, and not model size, is the key to the LLM's zero-shot summarization capability.
arXiv Detail & Related papers (2023-01-31T18:46:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.