Related papers: LLM-Driven Usefulness Labeling for IR Evaluation

LLM-Driven Usefulness Labeling for IR Evaluation

URL: http://arxiv.org/abs/2503.08965v1
Date: Wed, 12 Mar 2025 00:07:39 GMT
Title: LLM-Driven Usefulness Labeling for IR Evaluation
Authors: Mouly Dewan, Jiqun Liu, Chirag Shah,
Abstract summary: This study focuses on LLM-generated usefulness labels, a crucial evaluation metric that considers the user's search intents and task objectives.<n>Our experiment utilizes task-level, query-level, and document-level features along with user search behavior signals, which are essential in defining the usefulness of a document.
Score: 13.22615100911924
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In the information retrieval (IR) domain, evaluation plays a crucial role in optimizing search experiences and supporting diverse user intents. In the recent LLM era, research has been conducted to automate document relevance labels, as these labels have traditionally been assigned by crowd-sourced workers - a process that is both time and consuming and costly. This study focuses on LLM-generated usefulness labels, a crucial evaluation metric that considers the user's search intents and task objectives, an aspect where relevance falls short. Our experiment utilizes task-level, query-level, and document-level features along with user search behavior signals, which are essential in defining the usefulness of a document. Our research finds that (i) pre-trained LLMs can generate moderate usefulness labels by understanding the comprehensive search task session, (ii) pre-trained LLMs perform better judgement in short search sessions when provided with search session contexts. Additionally, we investigated whether LLMs can capture the unique divergence between relevance and usefulness, along with conducting an ablation study to identify the most critical metrics for accurate usefulness label generation. In conclusion, this work explores LLM-generated usefulness labels by evaluating critical metrics and optimizing for practicality in real-world settings.

Related papers

Leveraging LLMs to Evaluate Usefulness of Document [25.976948104719746]
We introduce a new user-centric evaluation framework that integrates users' search context and behavioral data into large language models.<n>Our study demonstrates that when well-guided with context and behavioral information, LLMs can accurately evaluate usefulness.<n>We also apply the labels produced by our method to predict user satisfaction, with real-world experiments indicating that these labels substantially improve the performance of satisfaction prediction models.
arXiv Detail & Related papers (2025-06-10T09:44:03Z)
Iterative Self-Incentivization Empowers Large Language Models as Agentic Searchers [74.17516978246152]
Large language models (LLMs) have been widely integrated into information retrieval to advance traditional techniques.<n>We propose EXSEARCH, an agentic search framework, where the LLM learns to retrieve useful information as the reasoning unfolds.<n>Experiments on four knowledge-intensive benchmarks show that EXSEARCH substantially outperforms baselines.
arXiv Detail & Related papers (2025-05-26T15:27:55Z)
LLM-Driven Usefulness Judgment for Web Search Evaluation [12.10711284043516]
Evaluation is fundamental in optimizing search experiences and supporting diverse user intents in Information Retrieval (IR) Traditional search evaluation methods primarily rely on relevance labels, which assess how well retrieved documents match a user's query. In this paper, we explore an alternative approach: LLM-generated usefulness labels, which incorporate both implicit and explicit user behavior signals to evaluate document usefulness.
arXiv Detail & Related papers (2025-04-19T20:38:09Z)
Can LLMs Generate Tabular Summaries of Science Papers? Rethinking the Evaluation Protocol [83.90769864167301]
Literature review tables are essential for summarizing and comparing collections of scientific papers. We explore the task of generating tables that best fulfill a user's informational needs given a collection of scientific papers. Our contributions focus on three key challenges encountered in real-world use: (i) User prompts are often under-specified; (ii) Retrieved candidate papers frequently contain irrelevant content; and (iii) Task evaluation should move beyond shallow text similarity techniques.
arXiv Detail & Related papers (2025-04-14T14:52:28Z)
How do Large Language Models Understand Relevance? A Mechanistic Interpretability Perspective [64.00022624183781]
Large language models (LLMs) can assess relevance and support information retrieval (IR) tasks. We investigate how different LLM modules contribute to relevance judgment through the lens of mechanistic interpretability.
arXiv Detail & Related papers (2025-04-10T16:14:55Z)
Leveraging LLMs for Utility-Focused Annotation: Reducing Manual Effort for Retrieval and RAG [69.51637252264277]
We investigate whether Large Language Models (LLMs) can effectively replace human annotations in training retrieval models. Our experiments show that retrievers trained on utility-focused annotations significantly outperform those trained on human annotations in the out-of-domain setting. Just 20% human-annotated data enables retrievers trained with utility-focused annotations to match the performance of models trained entirely with human annotations.
arXiv Detail & Related papers (2025-04-07T16:05:52Z)
Automated Query-Product Relevance Labeling using Large Language Models for E-commerce Search [3.392843594990172]
Traditional approaches for annotating query-product pairs rely on human-based labeling services.<n>We show that Large Language Models (LLMs) can approach human-level accuracy on this task in a fraction of the time and cost required by human-labelers.<n>This scalable alternative to human-annotation has significant implications for information retrieval domains.
arXiv Detail & Related papers (2025-02-21T22:59:36Z)
Latent Factor Models Meets Instructions:Goal-conditioned Latent Factor Discovery without Task Supervision [50.45597801390757]
Instruct-LF is a goal-oriented latent factor discovery system.<n>It integrates instruction-following ability with statistical models to handle noisy datasets.
arXiv Detail & Related papers (2025-02-21T02:03:08Z)
From Selection to Generation: A Survey of LLM-based Active Learning [153.8110509961261]
Large Language Models (LLMs) have been employed for generating entirely new data instances and providing more cost-effective annotations.<n>This survey aims to serve as an up-to-date resource for researchers and practitioners seeking to gain an intuitive understanding of LLM-based AL techniques.
arXiv Detail & Related papers (2025-02-17T12:58:17Z)
From Human Annotation to LLMs: SILICON Annotation Workflow for Management Research [13.818244562506138]
Large Language Models (LLMs) provide a cost-effective and efficient alternative to human annotation.<n>This paper introduces the SILICON" (Systematic Inference with LLMs for Information Classification and Notation) workflow.<n>The workflow integrates established principles of human annotation with systematic prompt optimization and model selection.
arXiv Detail & Related papers (2024-12-19T02:21:41Z)
Towards Boosting LLMs-driven Relevance Modeling with Progressive Retrieved Behavior-augmented Prompting [23.61061000692023]
This study proposes leveraging user interactions recorded in search logs to yield insights into users' implicit search intentions.<n>We propose ProRBP, a novel Progressive Retrieved Behavior-augmented Prompting framework for integrating search scenario-oriented knowledge with Large Language Models.
arXiv Detail & Related papers (2024-08-18T11:07:38Z)
LLMJudge: LLMs for Relevance Judgments [37.103230004631996]
The challenge is organized as part of the LLM4Eval workshop at SIGIR 2024. Recent studies have shown that LLMs can generate reliable relevance judgments for search systems. The collected data will be released as a package to support automatic relevance judgment research.
arXiv Detail & Related papers (2024-08-09T23:15:41Z)
CLAMBER: A Benchmark of Identifying and Clarifying Ambiguous Information Needs in Large Language Models [60.59638232596912]
We introduce CLAMBER, a benchmark for evaluating large language models (LLMs) Building upon the taxonomy, we construct 12K high-quality data to assess the strengths, weaknesses, and potential risks of various off-the-shelf LLMs. Our findings indicate the limited practical utility of current LLMs in identifying and clarifying ambiguous user queries.
arXiv Detail & Related papers (2024-05-20T14:34:01Z)
The Shifted and The Overlooked: A Task-oriented Investigation of User-GPT Interactions [114.67699010359637]
We analyze a large-scale collection of real user queries to GPT. We find that tasks such as design'' and planning'' are prevalent in user interactions but are largely neglected or different from traditional NLP benchmarks.
arXiv Detail & Related papers (2023-10-19T02:12:17Z)
Synergistic Interplay between Search and Large Language Models for Information Retrieval [141.18083677333848]
InteR allows RMs to expand knowledge in queries using LLM-generated knowledge collections. InteR achieves overall superior zero-shot retrieval performance compared to state-of-the-art methods.
arXiv Detail & Related papers (2023-05-12T11:58:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.