Related papers: To Judge or not to Judge: Using LLM Judgements for Advertiser Keyphrase Relevance at eBay

To Judge or not to Judge: Using LLM Judgements for Advertiser Keyphrase Relevance at eBay

URL: http://arxiv.org/abs/2505.04209v2
Date: Thu, 29 May 2025 05:39:34 GMT
Title: To Judge or not to Judge: Using LLM Judgements for Advertiser Keyphrase Relevance at eBay
Authors: Soumik Dey, Hansi Wu, Binbin Li,
Abstract summary: E-commerce sellers are recommended keyphrases based on their inventory to increase buyer engagement (clicks/sales)<n> relevance of advertiser keyphrases plays an important role in preventing the inundation of search systems with numerous irrelevant items.<n>This study discusses the practicalities of using human judgment via a case study at eBay Advertising.
Score: 1.7058804466282262
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: E-commerce sellers are recommended keyphrases based on their inventory on which they advertise to increase buyer engagement (clicks/sales). The relevance of advertiser keyphrases plays an important role in preventing the inundation of search systems with numerous irrelevant items that compete for attention in auctions, in addition to maintaining a healthy seller perception. In this work, we describe the shortcomings of training Advertiser keyphrase relevance filter models on click/sales/search relevance signals and the importance of aligning with human judgment, as sellers have the power to adopt or reject said keyphrase recommendations. In this study, we frame Advertiser keyphrase relevance as a complex interaction between 3 dynamical systems -- seller judgment, which influences seller adoption of our product, Advertising, which provides the keyphrases to bid on, and Search, who holds the auctions for the same keyphrases. This study discusses the practicalities of using human judgment via a case study at eBay Advertising and demonstrate that using LLM-as-a-judge en-masse as a scalable proxy for seller judgment to train our relevance models achieves a better harmony across the three systems -- provided that they are bound by a meticulous evaluation framework grounded in business metrics.

Related papers

LLMDistill4Ads: Using Cross-Encoders to Distill from LLM Signals for Advertiser Keyphrase Recommendations at eBay [1.4555205338313157]
This study introduces a novel two-step LLM distillation process from a LLM-judge used to debias our Embedding Based Retrieval (EBR) model from click-data.<n>We distill from an LLM teacher via a cross-encoder assistant into a bi-encoder student using a multi-task training approach, ultimately employing the student bi-encoder to retrieve relevant advertiser keyphrases.
arXiv Detail & Related papers (2025-08-05T16:47:17Z)
Multi-objective Aligned Bidword Generation Model for E-commerce Search Advertising [16.8420671443003]
Retrieval systems primarily address the challenge of matching user queries with the most relevant advertisements.<n>We propose a Multi-objective aligned Bidword Generation Model (MoBGM), which is composed of a discriminator, generator, and preference alignment module.<n>Our proposed algorithm significantly outperforms the state of the art in offline and online experiments.
arXiv Detail & Related papers (2025-06-04T10:57:18Z)
Middleman Bias in Advertising: Aligning Relevance of Keyphrase Recommendations with Search [4.275764895529604]
We describe the shortcomings of training relevance filter models on biased click/sales signals.<n>We re-conceptualize advertiser keyphrase relevance as interaction between two dynamical systems.<n>We discuss the bias of search relevance systems and the need to align advertiser keyphrases with search relevance signals.
arXiv Detail & Related papers (2025-01-31T19:28:26Z)
JudgeRank: Leveraging Large Language Models for Reasoning-Intensive Reranking [81.88787401178378]
We introduce JudgeRank, a novel agentic reranker that emulates human cognitive processes when assessing document relevance. We evaluate JudgeRank on the reasoning-intensive BRIGHT benchmark, demonstrating substantial performance improvements over first-stage retrieval methods. In addition, JudgeRank performs on par with fine-tuned state-of-the-art rerankers on the popular BEIR benchmark, validating its zero-shot generalization capability.
arXiv Detail & Related papers (2024-10-31T18:43:12Z)
GraphEx: A Graph-based Extraction Method for Advertiser Keyphrase Recommendation [3.167259972777881]
GraphEx is an innovative graph-based approach that recommends keyphrases to sellers using extraction of token permutations from item titles.<n>It supports near real-time inferencing in resource-constrained production environments and scales effectively for billions of items.
arXiv Detail & Related papers (2024-09-05T00:25:37Z)
Advancing Ad Auction Realism: Practical Insights & Modeling Implications [2.8413290300628313]
This paper shows that one can still gain useful insight into modern ad auctions by modeling advertisers as agents governed by an adversarial bandit algorithm. We find that soft floors yield lower revenues than suitably chosen reserve prices, even restricting attention to a single query.
arXiv Detail & Related papers (2023-07-21T17:45:28Z)
Perspectives on Large Language Models for Relevance Judgment [56.935731584323996]
Large language models (LLMs) claim that they can assist with relevance judgments. It is not clear whether automated judgments can reliably be used in evaluations of retrieval systems.
arXiv Detail & Related papers (2023-04-13T13:08:38Z)
KPEval: Towards Fine-Grained Semantic-Based Keyphrase Evaluation [69.57018875757622]
We propose KPEval, a comprehensive evaluation framework consisting of four critical aspects: reference agreement, faithfulness, diversity, and utility. Using KPEval, we re-evaluate 23 keyphrase systems and discover that established model comparison results have blind-spots.
arXiv Detail & Related papers (2023-03-27T17:45:38Z)
Hierarchical Conversational Preference Elicitation with Bandit Feedback [36.507341041113825]
We formulate a new conversational bandit problem that allows the recommender system to choose either a key-term or an item to recommend at each round. We conduct a survey and analyze a real-world dataset to find that, unlike assumptions made in prior works, key-term rewards are mainly affected by rewards of representative items. We propose two bandit algorithms, Hier-UCB and Hier-LinUCB, that leverage this observed relationship and the hierarchical structure between key-terms and items.
arXiv Detail & Related papers (2022-09-06T05:35:24Z)
A novel auction system for selecting advertisements in Real-Time bidding [68.8204255655161]
Real-Time Bidding is a new Internet advertising system that has become very popular in recent years. We propose an alternative betting system with a new approach that not only considers the economic aspect but also other relevant factors for the functioning of the advertising system.
arXiv Detail & Related papers (2020-10-22T18:36:41Z)
Examining the Ordering of Rhetorical Strategies in Persuasive Requests [58.63432866432461]
We use a Variational Autoencoder model to disentangle content and rhetorical strategies in textual requests from a large-scale loan request corpus. We find that specific (orderings of) strategies interact uniquely with a request's content to impact success rate, and thus the persuasiveness of a request.
arXiv Detail & Related papers (2020-10-09T15:10:44Z)
Learning to Infer User Hidden States for Online Sequential Advertising [52.169666997331724]
We propose our Deep Intents Sequential Advertising (DISA) method to address these issues. The key part of interpretability is to understand a consumer's purchase intent which is, however, unobservable (called hidden states)
arXiv Detail & Related papers (2020-09-03T05:12:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.