Counterfactually Evaluating Explanations in Recommender Systems
- URL: http://arxiv.org/abs/2203.01310v1
- Date: Wed, 2 Mar 2022 18:55:29 GMT
- Title: Counterfactually Evaluating Explanations in Recommender Systems
- Authors: Yuanshun Yao and Chong Wang and Hang Li
- Abstract summary: We propose an offline evaluation method that can be computed without human involvement.
We show that, compared to conventional methods, our method can produce evaluation scores more correlated with the real human judgments.
- Score: 14.938252589829673
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modern recommender systems face an increasing need to explain their
recommendations. Despite considerable progress in this area, evaluating the
quality of explanations remains a significant challenge for researchers and
practitioners. Prior work mainly conducts human study to evaluate explanation
quality, which is usually expensive, time-consuming, and prone to human bias.
In this paper, we propose an offline evaluation method that can be computed
without human involvement. To evaluate an explanation, our method quantifies
its counterfactual impact on the recommendation. To validate the effectiveness
of our method, we carry out an online user study. We show that, compared to
conventional methods, our method can produce evaluation scores more correlated
with the real human judgments, and therefore can serve as a better proxy for
human evaluation. In addition, we show that explanations with high evaluation
scores are considered better by humans. Our findings highlight the promising
direction of using the counterfactual approach as one possible way to evaluate
recommendation explanations.
Related papers
- Revisiting Reciprocal Recommender Systems: Metrics, Formulation, and Method [60.364834418531366]
We propose five new evaluation metrics that comprehensively and accurately assess the performance of RRS.
We formulate the RRS from a causal perspective, formulating recommendations as bilateral interventions.
We introduce a reranking strategy to maximize matching outcomes, as measured by the proposed metrics.
arXiv Detail & Related papers (2024-08-19T07:21:02Z) - Navigating the Evaluation Funnel to Optimize Iteration Speed for Recommender Systems [0.0]
We present a novel framework that simplifies the reasoning around the evaluation funnel for a recommendation system.
We show that decomposing the definition of success into smaller necessary criteria for success enables early identification of non-successful ideas.
We go through so-called offline and online evaluation methods such as counterfactual logging, validation, verification, A/B testing, and interleaving.
arXiv Detail & Related papers (2024-04-03T17:15:45Z) - Evaluation in Neural Style Transfer: A Review [0.7614628596146599]
We provide an in-depth analysis of existing evaluation techniques, identify the inconsistencies and limitations of current evaluation methods, and give recommendations for standardized evaluation practices.
We believe that the development of a robust evaluation framework will not only enable more meaningful and fairer comparisons but will also enhance the comprehension and interpretation of research findings in the field.
arXiv Detail & Related papers (2024-01-30T15:45:30Z) - Towards a Comprehensive Human-Centred Evaluation Framework for
Explainable AI [1.7222662622390634]
We propose to adapt the User-Centric Evaluation Framework used in recommender systems.
We integrate explanation aspects, summarise explanation properties, indicate relations between them, and categorise metrics that measure these properties.
arXiv Detail & Related papers (2023-07-31T09:20:16Z) - Provable Benefits of Policy Learning from Human Preferences in
Contextual Bandit Problems [82.92678837778358]
preference-based methods have demonstrated substantial success in empirical applications such as InstructGPT.
We show how human bias and uncertainty in feedback modelings can affect the theoretical guarantees of these approaches.
arXiv Detail & Related papers (2023-07-24T17:50:24Z) - Learning and Evaluating Human Preferences for Conversational Head
Generation [101.89332968344102]
We propose a novel learning-based evaluation metric named Preference Score (PS) for fitting human preference according to the quantitative evaluations across different dimensions.
PS can serve as a quantitative evaluation without the need for human annotation.
arXiv Detail & Related papers (2023-07-20T07:04:16Z) - Revisiting the Gold Standard: Grounding Summarization Evaluation with
Robust Human Evaluation [136.16507050034755]
Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale.
We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units.
We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
arXiv Detail & Related papers (2022-12-15T17:26:05Z) - Doubting AI Predictions: Influence-Driven Second Opinion Recommendation [92.30805227803688]
We propose a way to augment human-AI collaboration by building on a common organizational practice: identifying experts who are likely to provide complementary opinions.
The proposed approach aims to leverage productive disagreement by identifying whether some experts are likely to disagree with an algorithmic assessment.
arXiv Detail & Related papers (2022-04-29T20:35:07Z) - Measuring "Why" in Recommender Systems: a Comprehensive Survey on the
Evaluation of Explainable Recommendation [87.82664566721917]
This survey is based on more than 100 papers from top-tier conferences like IJCAI, AAAI, TheWebConf, Recsys, UMAP, and IUI.
arXiv Detail & Related papers (2022-02-14T02:58:55Z) - On the Interaction of Belief Bias and Explanations [4.211128681972148]
We provide an overview of belief bias, its role in human evaluation, and ideas for NLP practitioners on how to account for it.
We show that conclusions about the highest performing methods change when introducing such controls, pointing to the importance of accounting for belief bias in evaluation.
arXiv Detail & Related papers (2021-06-29T12:49:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.