Related papers: LLM-based Relevance Assessment for Web-Scale Search Evaluation at Pinterest

LLM-based Relevance Assessment for Web-Scale Search Evaluation at Pinterest

URL: http://arxiv.org/abs/2509.03764v1
Date: Wed, 03 Sep 2025 23:07:49 GMT
Title: LLM-based Relevance Assessment for Web-Scale Search Evaluation at Pinterest
Authors: Han Wang, Alex Whitworth, Pak Ming Cheung, Zhenjie Zhang, Krishna Kamath,
Abstract summary: We present our approach at Pinterest Search to automate relevance evaluation for online experiments using fine-tuned LLMs.<n>We rigorously validate the alignment between LLM-generated judgments and human annotations.<n>This approach leads to higher-quality relevance metrics and significantly reduces the Minimum Detectable Effect (MDE) in online experiment measurements.
Score: 3.306725465028306
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Relevance evaluation plays a crucial role in personalized search systems to ensure that search results align with a user's queries and intent. While human annotation is the traditional method for relevance evaluation, its high cost and long turnaround time limit its scalability. In this work, we present our approach at Pinterest Search to automate relevance evaluation for online experiments using fine-tuned LLMs. We rigorously validate the alignment between LLM-generated judgments and human annotations, demonstrating that LLMs can provide reliable relevance measurement for experiments while greatly improving the evaluation efficiency. Leveraging LLM-based labeling further unlocks the opportunities to expand the query set, optimize sampling design, and efficiently assess a wider range of search experiences at scale. This approach leads to higher-quality relevance metrics and significantly reduces the Minimum Detectable Effect (MDE) in online experiment measurements.

Related papers

When LLM Judges Inflate Scores: Exploring Overrating in Relevance Assessment [29.603396943658428]
Large language models (LLMs) can be used as proxies for human judges.<n>We show that models consistently assign inflated relevance scores to passages that do not genuinely satisfy the underlying information need.<n>Experiments show that LLM-based relevance judgments can be highly sensitive to passage length and surface-level lexical cues.
arXiv Detail & Related papers (2026-02-19T08:37:21Z)
On Evaluating LLM Alignment by Evaluating LLMs as Judges [68.15541137648721]
evaluating large language models' (LLMs) alignment requires them to be helpful, honest, safe, and to precisely follow human instructions.<n>We examine the relationship between LLMs' generation and evaluation capabilities in aligning with human preferences.<n>We propose a benchmark that assesses alignment without directly evaluating model outputs.
arXiv Detail & Related papers (2025-11-25T18:33:24Z)
WebDevJudge: Evaluating (M)LLMs as Critiques for Web Development Quality [62.43165871914528]
We introduce WebDevJudge, a systematic benchmark for assessing LLM-as-a-judge performance in web development.<n>WebDevJudge comprises human preference labels over paired web implementations, annotated with structured and query-grounded rubrics.<n>In-depth analysis indicates this gap stems from fundamental model limitations, including failures in recognizing functional equivalence, verifying task feasibility, and mitigating bias.
arXiv Detail & Related papers (2025-10-21T12:16:04Z)
Exploring the Potential of LLMs for Serendipity Evaluation in Recommender Systems [10.227007419503297]
Large language models (LLMs) are increasingly revolutionizing evaluation methodologies across various human annotation tasks.<n>We conduct a meta-evaluation on two datasets derived from real user studies in the e-commerce and movie domains.<n>Our findings indicate that even the simplest zero-shot LLMs achieve parity with, or surpass, the performance of conventional metrics.
arXiv Detail & Related papers (2025-07-23T07:51:56Z)
Leveraging LLMs to Evaluate Usefulness of Document [25.976948104719746]
We introduce a new user-centric evaluation framework that integrates users' search context and behavioral data into large language models.<n>Our study demonstrates that when well-guided with context and behavioral information, LLMs can accurately evaluate usefulness.<n>We also apply the labels produced by our method to predict user satisfaction, with real-world experiments indicating that these labels substantially improve the performance of satisfaction prediction models.
arXiv Detail & Related papers (2025-06-10T09:44:03Z)
LLM-Driven Usefulness Judgment for Web Search Evaluation [12.10711284043516]
Evaluation is fundamental in optimizing search experiences and supporting diverse user intents in Information Retrieval (IR)<n>Traditional search evaluation methods primarily rely on relevance labels, which assess how well retrieved documents match a user's query.<n>In this paper, we explore an alternative approach: LLM-generated usefulness labels, which incorporate both implicit and explicit user behavior signals to evaluate document usefulness.
arXiv Detail & Related papers (2025-04-19T20:38:09Z)
Utility-Focused LLM Annotation for Retrieval and Retrieval-Augmented Generation [96.18720164390699]
This paper explores the use of large language models (LLMs) for annotating document utility in training retrieval and retrieval-augmented generation (RAG) systems.<n>Our results show that LLM-generated annotations enhance out-of-domain retrieval performance and improve RAG outcomes compared to models trained solely on human annotations or downstream QA metrics.
arXiv Detail & Related papers (2025-04-07T16:05:52Z)
DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.<n>The question of how reliable these evaluators are has emerged as a crucial research question.<n>We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z)
Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators [48.54465599914978]
Large Language Models (LLMs) have demonstrated promising capabilities as automatic evaluators in assessing the quality of generated natural language.<n>LLMs still exhibit biases in evaluation and often struggle to generate coherent evaluations that align with human assessments.<n>We introduce Pairwise-preference Search (PAIRS), an uncertainty-guided search-based rank aggregation method that employs LLMs to conduct pairwise comparisons locally and efficiently ranks candidate texts globally.
arXiv Detail & Related papers (2024-03-25T17:11:28Z)
PiCO: Peer Review in LLMs based on the Consistency Optimization [48.48819141999387]
We use peer-review mechanisms to measure large language models (LLMs) automatically.<n>We formalize it as a constrained optimization problem, intending to maximize the consistency of each LLM's capabilities and scores.<n>We propose three metrics called PEN, CIN, and LIS to evaluate the gap in aligning human rankings.
arXiv Detail & Related papers (2024-02-02T18:49:26Z)
From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning [52.257422715393574]
We introduce a self-guided methodology for Large Language Models (LLMs) to autonomously discern and select cherry samples from open-source datasets. Our key innovation, the Instruction-Following Difficulty (IFD) metric, emerges as a pivotal metric to identify discrepancies between a model's expected responses and its intrinsic generation capability.
arXiv Detail & Related papers (2023-08-23T09:45:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.