Related papers: PerSEval: Assessing Personalization in Text Summarizers

PerSEval: Assessing Personalization in Text Summarizers

URL: http://arxiv.org/abs/2407.00453v2
Date: Fri, 25 Oct 2024 04:36:23 GMT
Title: PerSEval: Assessing Personalization in Text Summarizers
Authors: Sourish Dasgupta, Ankush Chander, Parth Borad, Isha Motiyani, Tanmoy Chakraborty,
Abstract summary: We argue that accuracy measures are inadequate for evaluating the degree of personalization of personalized text summaries. We propose PerSEval, a novel measure that satisfies the required sufficiency condition.
Score: 14.231110627461
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Personalized summarization models cater to individuals' subjective understanding of saliency, as represented by their reading history and current topics of attention. Existing personalized text summarizers are primarily evaluated based on accuracy measures such as BLEU, ROUGE, and METEOR. However, a recent study argued that accuracy measures are inadequate for evaluating the degree of personalization of these models and proposed EGISES, the first metric to evaluate personalized text summaries. It was suggested that accuracy is a separate aspect and should be evaluated standalone. In this paper, we challenge the necessity of an accuracy leaderboard, suggesting that relying on accuracy-based aggregated results might lead to misleading conclusions. To support this, we delve deeper into EGISES, demonstrating both theoretically and empirically that it measures the degree of responsiveness, a necessary but not sufficient condition for degree-of-personalization. We subsequently propose PerSEval, a novel measure that satisfies the required sufficiency condition. Based on the benchmarking of ten SOTA summarization models on the PENS dataset, we empirically establish that -- (i) PerSEval is reliable w.r.t human-judgment correlation (Pearson's r = 0.73; Spearman's $\rho$ = 0.62; Kendall's $\tau$ = 0.42), (ii) PerSEval has high rank-stability, (iii) PerSEval as a rank-measure is not entailed by EGISES-based ranking, and (iv) PerSEval can be a standalone rank-measure without the need of any aggregated ranking.

Related papers

Generalized Leverage Score for Scalable Assessment of Privacy Vulnerability [6.029433950934382]
We show that exposure to membership inference attack (MIA) is governed by a data point's influence on the learned model.<n>We formalize this in the linear setting by establishing a theoretical correspondence between individual MIA risk and the leverage score.<n>This characterization explains how data-dependent sensitivity translates into exposure, without the computational burden of training shadow models.
arXiv Detail & Related papers (2026-02-17T07:07:31Z)
CCE: Confidence-Consistency Evaluation for Time Series Anomaly Detection [56.302586730134806]
We introduce Confidence-Consistency Evaluation (CCE), a novel evaluation metric.<n>CCE simultaneously measures prediction confidence and uncertainty consistency.<n>We also establish RankEval, a benchmark for comparing the ranking capabilities of various metrics.
arXiv Detail & Related papers (2025-09-01T03:38:38Z)
RAG-Zeval: Towards Robust and Interpretable Evaluation on RAG Responses through End-to-End Rule-Guided Reasoning [64.46921169261852]
RAG-Zeval is a novel end-to-end framework that formulates faithfulness and correctness evaluation as a rule-guided reasoning task.<n>Our approach trains evaluators with reinforcement learning, facilitating compact models to generate comprehensive and sound assessments.<n>Experiments demonstrate RAG-Zeval's superior performance, achieving the strongest correlation with human judgments.
arXiv Detail & Related papers (2025-05-28T14:55:33Z)
A Probabilistic Perspective on Unlearning and Alignment for Large Language Models [48.96686419141881]
We introduce the first formal probabilistic evaluation framework in Large Language Models (LLMs) We derive novel metrics with high-probability guarantees concerning the output distribution of a model. Our metrics are application-independent and allow practitioners to make more reliable estimates about model capabilities before deployment.
arXiv Detail & Related papers (2024-10-04T15:44:23Z)
Semi-supervised Learning For Robust Speech Evaluation [30.593420641501968]
Speech evaluation measures a learners oral proficiency using automatic models. This paper proposes to address such challenges by exploiting semi-supervised pre-training and objective regularization. An anchor model is trained using pseudo labels to predict the correctness of pronunciation.
arXiv Detail & Related papers (2024-09-23T02:11:24Z)
TeLeS: Temporal Lexeme Similarity Score to Estimate Confidence in End-to-End ASR [1.8477401359673709]
Class-probability-based confidence scores do not accurately represent quality of overconfident ASR predictions. We propose a novel Temporal-Lexeme Similarity (TeLeS) confidence score to train Confidence Estimation Model (CEM) We conduct experiments with ASR models trained in three languages, namely Hindi, Tamil, and Kannada, with varying training data sizes.
arXiv Detail & Related papers (2024-01-06T16:29:13Z)
Goodhart's Law Applies to NLP's Explanation Benchmarks [57.26445915212884]
We critically examine two sets of metrics: the ERASER metrics (comprehensiveness and sufficiency) and the EVAL-X metrics. We show that we can inflate a model's comprehensiveness and sufficiency scores dramatically without altering its predictions or explanations on in-distribution test inputs. Our results raise doubts about the ability of current metrics to guide explainability research, underscoring the need for a broader reassessment of what precisely these metrics are intended to capture.
arXiv Detail & Related papers (2023-08-28T03:03:03Z)
TopP&R: Robust Support Estimation Approach for Evaluating Fidelity and Diversity in Generative Models [9.048102020202817]
Topological Precision and Recall (TopP&R) provides a systematic approach to estimating supports. We show that TopP&R is robust to outliers and non-independent and identically distributed (Non-IID) perturbations. This is the first evaluation metric focused on the robust estimation of the support and provides its statistical consistency under noise.
arXiv Detail & Related papers (2023-06-13T11:46:00Z)
An Effective Meaningful Way to Evaluate Survival Models [34.21432603301076]
In practice, the test set includes (right) censored individuals, meaning we do not know when a censored individual actually experienced the event. We introduce a novel and effective approach for generating realistic semi-synthetic survival datasets. Our proposed metric is able to rank models accurately based on their performance, and often closely matches the true MAE.
arXiv Detail & Related papers (2023-06-01T23:22:46Z)
Ambiguity Meets Uncertainty: Investigating Uncertainty Estimation for Word Sense Disambiguation [5.55197751179213]
Existing supervised methods treat WSD as a classification task and have achieved remarkable performance. This paper extensively studies uncertainty estimation (UE) on the benchmark designed for WSD. We examine the capability of capturing data and model uncertainties by the model with the selected UE score on well-designed test scenarios and discover that the model reflects data uncertainty satisfactorily but underestimates model uncertainty.
arXiv Detail & Related papers (2023-05-22T15:18:15Z)
Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation [136.16507050034755]
Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale. We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units. We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
arXiv Detail & Related papers (2022-12-15T17:26:05Z)
Measuring Fairness of Text Classifiers via Prediction Sensitivity [63.56554964580627]
ACCUMULATED PREDICTION SENSITIVITY measures fairness in machine learning models based on the model's prediction sensitivity to perturbations in input features. We show that the metric can be theoretically linked with a specific notion of group fairness (statistical parity) and individual fairness.
arXiv Detail & Related papers (2022-03-16T15:00:33Z)
Investigating Crowdsourcing Protocols for Evaluating the Factual Consistency of Summaries [59.27273928454995]
Current pre-trained models applied to summarization are prone to factual inconsistencies which misrepresent the source text or introduce extraneous information. We create a crowdsourcing evaluation framework for factual consistency using the rating-based Likert scale and ranking-based Best-Worst Scaling protocols. We find that ranking-based protocols offer a more reliable measure of summary quality across datasets, while the reliability of Likert ratings depends on the target dataset and the evaluation design.
arXiv Detail & Related papers (2021-09-19T19:05:00Z)
GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics. Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation. It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.