Recall, Robustness, and Lexicographic Evaluation
- URL: http://arxiv.org/abs/2302.11370v6
- Date: Sat, 30 Nov 2024 21:47:39 GMT
- Title: Recall, Robustness, and Lexicographic Evaluation
- Authors: Fernando Diaz, Michael D. Ekstrand, Bhaskar Mitra,
- Abstract summary: The application of recall without a formal evaluative motivation has led to criticism of recall as a vague or inappropriate measure.
Our analysis is composed of three tenets: recall, robustness, and lexicographic evaluation.
- Score: 49.13362412522523
- License:
- Abstract: Although originally developed to evaluate sets of items, recall is often used to evaluate rankings of items, including those produced by recommender, retrieval, and other machine learning systems. The application of recall without a formal evaluative motivation has led to criticism of recall as a vague or inappropriate measure. In light of this debate, we reflect on the measurement of recall in rankings from a formal perspective. Our analysis is composed of three tenets: recall, robustness, and lexicographic evaluation. First, we formally define `recall-orientation' as the sensitivity of a metric to a user interested in finding every relevant item. Second, we analyze recall-orientation from the perspective of robustness with respect to possible content consumers and providers, connecting recall to recent conversations about fair ranking. Finally, we extend this conceptual and theoretical treatment of recall by developing a practical preference-based evaluation method based on lexicographic comparison. Through extensive empirical analysis across three recommendation tasks and 17 information retrieval tasks, we establish that our new evaluation method, lexirecall, has convergent validity (i.e., it is correlated with existing recall metrics) and exhibits substantially higher sensitivity in terms of discriminative power and stability in the presence of missing labels. Our conceptual, theoretical, and empirical analysis substantially deepens our understanding of recall and motivates its adoption through connections to robustness and fairness.
Related papers
- Measuring the Robustness of Reference-Free Dialogue Evaluation Systems [12.332146893333952]
We present a benchmark for evaluating the robustness of reference-free dialogue metrics against four categories of adversarial attacks.
We analyze metrics such as DialogRPT, UniEval, and PromptEval across grounded and ungrounded datasets.
arXiv Detail & Related papers (2025-01-12T06:41:52Z) - MADial-Bench: Towards Real-world Evaluation of Memory-Augmented Dialogue Generation [15.64077949677469]
We present a novel Memory-Augmented Dialogue Benchmark (MADail-Bench) to evaluate the effectiveness of memory-augmented dialogue systems (MADS)
The benchmark assesses two tasks separately: memory retrieval and memory recognition with the incorporation of both passive and proactive memory recall data.
Results from cutting-edge embedding models and large language models on this benchmark indicate the potential for further advancement.
arXiv Detail & Related papers (2024-09-23T17:38:41Z) - Ranking evaluation metrics from a group-theoretic perspective [5.333192842860574]
We show instances resulting in inconsistent evaluations, sources of potential mistrust in commonly used metrics.
Our analysis sheds light on ranking evaluation metrics, highlighting that inconsistent evaluations should not be seen as a source of mistrust.
arXiv Detail & Related papers (2024-08-14T09:06:58Z) - Iterative Utility Judgment Framework via LLMs Inspired by Relevance in Philosophy [66.95501113584541]
Utility and topical relevance are critical measures in information retrieval.
We propose an Iterative utiliTy judgmEnt fraMework to promote each step of the cycle of Retrieval-Augmented Generation.
arXiv Detail & Related papers (2024-06-17T07:52:42Z) - Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMs [57.16442740983528]
In ad-hoc retrieval, evaluation relies heavily on user actions, including implicit feedback.
The role of user feedback in annotators' assessment of turns in a conversational perception has been little studied.
We focus on how the evaluation of task-oriented dialogue systems ( TDSs) is affected by considering user feedback, explicit or implicit, as provided through the follow-up utterance of a turn being evaluated.
arXiv Detail & Related papers (2024-04-19T16:45:50Z) - KPEval: Towards Fine-Grained Semantic-Based Keyphrase Evaluation [69.57018875757622]
We propose KPEval, a comprehensive evaluation framework consisting of four critical aspects: reference agreement, faithfulness, diversity, and utility.
Using KPEval, we re-evaluate 23 keyphrase systems and discover that established model comparison results have blind-spots.
arXiv Detail & Related papers (2023-03-27T17:45:38Z) - REV: Information-Theoretic Evaluation of Free-Text Rationales [83.24985872655738]
We argue that an ideal metric should focus on the new information uniquely provided in the rationale that is otherwise not provided in the input or the label.
We propose a metric called REV (Rationale Evaluation with conditional V-information) to quantify the amount of new, label-relevant information in a rationale.
arXiv Detail & Related papers (2022-10-10T19:31:30Z) - A Training-free and Reference-free Summarization Evaluation Metric via
Centrality-weighted Relevance and Self-referenced Redundancy [60.419107377879925]
We propose a training-free and reference-free summarization evaluation metric.
Our metric consists of a centrality-weighted relevance score and a self-referenced redundancy score.
Our methods can significantly outperform existing methods on both multi-document and single-document summarization evaluation.
arXiv Detail & Related papers (2021-06-26T05:11:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.