Can LLM Annotations Replace User Clicks for Learning to Rank?
- URL: http://arxiv.org/abs/2511.06635v1
- Date: Mon, 10 Nov 2025 02:26:14 GMT
- Title: Can LLM Annotations Replace User Clicks for Learning to Rank?
- Authors: Lulu Yu, Keping Bi, Jiafeng Guo, Shihao Liu, Shuaiqiang Wang, Dawei Yin, Xueqi Cheng,
- Abstract summary: Large-scale supervised data is essential for training modern ranking models, but obtaining high-quality human annotations is costly.<n>Click data has been widely used as a low-cost alternative, and with recent advances in large language models (LLMs), LLM-based relevance annotation has emerged as another promising annotation.<n> Experiments on both a public dataset, TianGong-ST, and an industrial dataset, Baidu-Click, show that click-supervised models perform better on high-frequency queries.<n>We explore two training strategies -- data scheduling and frequency-aware multi-objective learning -- that integrate both supervision signals.
- Score: 112.2254432364736
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large-scale supervised data is essential for training modern ranking models, but obtaining high-quality human annotations is costly. Click data has been widely used as a low-cost alternative, and with recent advances in large language models (LLMs), LLM-based relevance annotation has emerged as another promising annotation. This paper investigates whether LLM annotations can replace click data for learning to rank (LTR) by conducting a comprehensive comparison across multiple dimensions. Experiments on both a public dataset, TianGong-ST, and an industrial dataset, Baidu-Click, show that click-supervised models perform better on high-frequency queries, while LLM annotation-supervised models are more effective on medium- and low-frequency queries. Further analysis shows that click-supervised models are better at capturing document-level signals such as authority or quality, while LLM annotation-supervised models are more effective at modeling semantic matching between queries and documents and at distinguishing relevant from non-relevant documents. Motivated by these observations, we explore two training strategies -- data scheduling and frequency-aware multi-objective learning -- that integrate both supervision signals. Both approaches enhance ranking performance across queries at all frequency levels, with the latter being more effective. Our code is available at https://github.com/Trustworthy-Information-Access/LLMAnn_Click.
Related papers
- LLMs as Data Annotators: How Close Are We to Human Performance [47.61698665650761]
Manual annotation of data is labor-intensive, time-consuming, and costly.<n>In-context learning (ICL) in which some examples related to the task are given in the prompt can lead to inefficiencies and suboptimal model performance.<n>This paper presents experiments comparing several LLMs, considering different embedding models, across various datasets for the Named Entity Recognition (NER) task.
arXiv Detail & Related papers (2025-04-21T11:11:07Z) - Utility-Focused LLM Annotation for Retrieval and Retrieval-Augmented Generation [96.18720164390699]
This paper explores the use of large language models (LLMs) for annotating document utility in training retrieval and retrieval-augmented generation (RAG) systems.<n>Our results show that LLM-generated annotations enhance out-of-domain retrieval performance and improve RAG outcomes compared to models trained solely on human annotations or downstream QA metrics.
arXiv Detail & Related papers (2025-04-07T16:05:52Z) - SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph [70.79413606968814]
We introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity.
Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data.
Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks.
arXiv Detail & Related papers (2024-06-25T04:27:53Z) - Large Language Model-guided Document Selection [23.673690115025913]
Large Language Model (LLM) pre-training exhausts an ever growing compute budget.
Recent research has demonstrated that careful document selection enables comparable model quality with only a fraction of the FLOPs.
We explore a promising direction for scalable general-domain document selection.
arXiv Detail & Related papers (2024-06-07T04:52:46Z) - LLMaAA: Making Large Language Models as Active Annotators [32.57011151031332]
We propose LLMaAA, which takes large language models as annotators and puts them into an active learning loop to determine what to annotate efficiently.
We conduct experiments and analysis on two classic NLP tasks, named entity recognition and relation extraction.
With LLMaAA, task-specific models trained from LLM-generated labels can outperform the teacher within only hundreds of annotated examples.
arXiv Detail & Related papers (2023-10-30T14:54:15Z) - LLM-augmented Preference Learning from Natural Language [19.700169351688768]
Large Language Models (LLMs) are equipped to deal with larger context lengths.
LLMs can consistently outperform the SotA when the target text is large.
Few-shot learning yields better performance than zero-shot learning.
arXiv Detail & Related papers (2023-10-12T17:17:27Z) - MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation.
Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results.
For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data.
For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z) - From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning [52.257422715393574]
We introduce a self-guided methodology for Large Language Models (LLMs) to autonomously discern and select cherry samples from open-source datasets.
Our key innovation, the Instruction-Following Difficulty (IFD) metric, emerges as a pivotal metric to identify discrepancies between a model's expected responses and its intrinsic generation capability.
arXiv Detail & Related papers (2023-08-23T09:45:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.