Related papers: Reliable Annotations with Less Effort: Evaluating LLM-Human Collaboration in Search Clarifications

Reliable Annotations with Less Effort: Evaluating LLM-Human Collaboration in Search Clarifications

URL: http://arxiv.org/abs/2507.00543v1
Date: Tue, 01 Jul 2025 08:04:58 GMT
Title: Reliable Annotations with Less Effort: Evaluating LLM-Human Collaboration in Search Clarifications
Authors: Leila Tavakoli, Hamed Zamani,
Abstract summary: This study focuses on annotation for the search clarification task, leveraging a high-quality, multi-dimensional dataset.<n>We show that even state-of-the-art models struggle to replicate human-level performance in subjective or fine-grained evaluation tasks.<n>We propose a simple yet effective human-in-the-loop (HITL) workflow that uses confidence thresholds and inter-model disagreement to selectively involve human review.
Score: 21.698669254520475
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Despite growing interest in using large language models (LLMs) to automate annotation, their effectiveness in complex, nuanced, and multi-dimensional labelling tasks remains relatively underexplored. This study focuses on annotation for the search clarification task, leveraging a high-quality, multi-dimensional dataset that includes five distinct fine-grained annotation subtasks. Although LLMs have shown impressive capabilities in general settings, our study reveals that even state-of-the-art models struggle to replicate human-level performance in subjective or fine-grained evaluation tasks. Through a systematic assessment, we demonstrate that LLM predictions are often inconsistent, poorly calibrated, and highly sensitive to prompt variations. To address these limitations, we propose a simple yet effective human-in-the-loop (HITL) workflow that uses confidence thresholds and inter-model disagreement to selectively involve human review. Our findings show that this lightweight intervention significantly improves annotation reliability while reducing human effort by up to 45%, offering a relatively scalable and cost-effective yet accurate path forward for deploying LLMs in real-world evaluation settings.

Related papers

Teaching Language Models To Gather Information Proactively [53.85419549904644]
Large language models (LLMs) are increasingly expected to function as collaborative partners.<n>In this work, we introduce a new task paradigm: proactive information gathering.<n>We design a scalable framework that generates partially specified, real-world tasks, masking key information.<n>Within this setup, our core innovation is a reinforcement finetuning strategy that rewards questions that elicit genuinely new, implicit user information.
arXiv Detail & Related papers (2025-07-28T23:50:09Z)
Real-World Summarization: When Evaluation Reaches Its Limits [1.4197924572122094]
We compare traditional metrics, trainable methods, and LLM-as-a-judge approaches.<n>Our findings reveal that simpler metrics like word overlap surprisingly well with human judgments.<n>Our analysis of real-world business impacts shows incorrect and non-checkable information pose the greatest risks.
arXiv Detail & Related papers (2025-07-15T17:23:56Z)
Evaluating LLM-corrupted Crowdsourcing Data Without Ground Truth [21.672923905771576]
Large language models (LLMs) by crowdsourcing workers pose a challenge to datasets intended to reflect human input.<n>We propose a training-free scoring mechanism with theoretical guarantees under a crowdsourcing model that accounts for LLM collusion.
arXiv Detail & Related papers (2025-06-08T04:38:39Z)
Self-Evolving Critique Abilities in Large Language Models [59.861013614500024]
This paper explores enhancing critique abilities of Large Language Models (LLMs)<n>We introduce SCRIT, a framework that trains LLMs with self-generated data to evolve their critique abilities.<n>Our analysis reveals that SCRIT's performance scales positively with data and model size.
arXiv Detail & Related papers (2025-01-10T05:51:52Z)
Your Weak LLM is Secretly a Strong Teacher for Alignment [19.33906256866585]
Existing alignment frameworks present constraints either in the form of expensive human effort or high computational costs.<n>This paper explores a promising middle ground, where we employ a weak LLM that is significantly less resource-intensive than top-tier models.<n>We show that weak LLMs can provide feedback that rivals or even exceeds that of fully human-annotated data.
arXiv Detail & Related papers (2024-09-13T13:24:52Z)
Exploring Automatic Cryptographic API Misuse Detection in the Era of LLMs [60.32717556756674]
This paper introduces a systematic evaluation framework to assess Large Language Models in detecting cryptographic misuses. Our in-depth analysis of 11,940 LLM-generated reports highlights that the inherent instabilities in LLMs can lead to over half of the reports being false positives. The optimized approach achieves a remarkable detection rate of nearly 90%, surpassing traditional methods and uncovering previously unknown misuses in established benchmarks.
arXiv Detail & Related papers (2024-07-23T15:31:26Z)
Towards Effective Evaluations and Comparisons for LLM Unlearning Methods [97.2995389188179]
This paper seeks to refine the evaluation of machine unlearning for large language models.<n>It addresses two key challenges -- the robustness of evaluation metrics and the trade-offs between competing goals.
arXiv Detail & Related papers (2024-06-13T14:41:00Z)
An In-depth Evaluation of Large Language Models in Sentence Simplification with Error-based Human Assessment [9.156064716689833]
This study provides in-depth insights into LLMs' performance while ensuring the reliability of the evaluation.<n>We select both closed-source and open-source LLMs, including GPT-4, Qwen2.5-72B, and Llama-3.2-3B.<n>Results show that GPT-4 generally generates fewer erroneous simplification outputs compared to the current state-of-the-art.<n>However, LLMs have their limitations, as seen in GPT-4's and Qwen2.5-72B's struggle with lexical paraphrasing.
arXiv Detail & Related papers (2024-03-08T00:19:24Z)
Identifying Factual Inconsistencies in Summaries: Grounding LLM Inference via Task Taxonomy [48.29181662640212]
Factual inconsistencies pose a significant hurdle for the faithful summarization by generative models. We consolidate key error types of inconsistent facts in summaries, and incorporate them to facilitate both the zero-shot and supervised paradigms of LLMs.
arXiv Detail & Related papers (2024-02-20T08:41:23Z)
MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation [60.65820977963331]
We introduce a novel evaluation paradigm for Large Language Models (LLMs) This paradigm shifts the emphasis from result-oriented assessments, which often neglect the reasoning process, to a more comprehensive evaluation. By applying this paradigm in the GSM8K dataset, we have developed the MR-GSM8K benchmark.
arXiv Detail & Related papers (2023-12-28T15:49:43Z)
Revisit Input Perturbation Problems for LLMs: A Unified Robustness Evaluation Framework for Noisy Slot Filling Task [18.623619585980688]
We propose a unified robustness evaluation framework based on the slot-filling task to evaluate the dialogue understanding capability of large language models. Specifically, we construct a input perturbation evaluation dataset, Noise-LLM, which contains five types of single perturbation and four types of mixed perturbation data. Our aim is to assess how well various robustness methods of LLMs perform in real-world noisy scenarios.
arXiv Detail & Related papers (2023-10-10T10:22:05Z)
Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools. Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions. Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z)
Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization [66.08074487429477]
We investigate the stability and reliability of large language models (LLMs) as automatic evaluators for abstractive summarization. We find that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements.
arXiv Detail & Related papers (2023-05-22T14:58:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.