Related papers: SPHERE: An Evaluation Card for Human-AI Systems

SPHERE: An Evaluation Card for Human-AI Systems

URL: http://arxiv.org/abs/2504.07971v1
Date: Mon, 24 Mar 2025 20:17:20 GMT
Title: SPHERE: An Evaluation Card for Human-AI Systems
Authors: Qianou Ma, Dora Zhao, Xinran Zhao, Chenglei Si, Chenyang Yang, Ryan Louie, Ehud Reiter, Diyi Yang, Tongshuang Wu,
Abstract summary: We present an evaluation card SPHERE, which encompasses five key dimensions.<n>We conduct a review of 39 human-AI systems using SPHERE, outlining current evaluation practices and areas for improvement.
Score: 75.0887588648484
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In the era of Large Language Models (LLMs), establishing effective evaluation methods and standards for diverse human-AI interaction systems is increasingly challenging. To encourage more transparent documentation and facilitate discussion on human-AI system evaluation design options, we present an evaluation card SPHERE, which encompasses five key dimensions: 1) What is being evaluated?; 2) How is the evaluation conducted?; 3) Who is participating in the evaluation?; 4) When is evaluation conducted?; 5) How is evaluation validated? We conduct a review of 39 human-AI systems using SPHERE, outlining current evaluation practices and areas for improvement. We provide three recommendations for improving the validity and rigor of evaluation practices.

Related papers

Manifesto from Dagstuhl Perspectives Workshop 24352 -- Conversational Agents: A Framework for Evaluation (CAFE) [59.64777874324281]
We defined the Conversational Agents Framework for Evaluation (CAFE) for the evaluation of CONIAC systems.<n>CAFE consists of six major components: 1) goals of the system's stakeholders, 2) user tasks to be studied in the evaluation, 3) aspects of the users carrying out the tasks, 4) evaluation criteria to be considered, 5) evaluation methodology to be applied, and 6) measures for the quantitative criteria chosen.
arXiv Detail & Related papers (2025-06-08T16:25:35Z)
An Empirical Study of Evaluating Long-form Question Answering [77.8023489322551]
We collect 5,236 factoid and non-factoid long-form answers generated by different large language models. We conduct a human evaluation on 2,079 of them, focusing on correctness and informativeness. We find that the style, length of the answers, and the category of questions can bias the automatic evaluation metrics.
arXiv Detail & Related papers (2025-04-25T15:14:25Z)
HREF: Human Response-Guided Evaluation of Instruction Following in Language Models [61.273153125847166]
We develop a new evaluation benchmark, Human Response-Guided Evaluation of Instruction Following (HREF)<n>In addition to providing reliable evaluation, HREF emphasizes individual task performance and is free from contamination.<n>We study the impact of key design choices in HREF, including the size of the evaluation set, the judge model, the baseline model, and the prompt template.
arXiv Detail & Related papers (2024-12-20T03:26:47Z)
Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMs [57.16442740983528]
In ad-hoc retrieval, evaluation relies heavily on user actions, including implicit feedback. The role of user feedback in annotators' assessment of turns in a conversational perception has been little studied. We focus on how the evaluation of task-oriented dialogue systems ( TDSs) is affected by considering user feedback, explicit or implicit, as provided through the follow-up utterance of a turn being evaluated.
arXiv Detail & Related papers (2024-04-19T16:45:50Z)
Navigating the Evaluation Funnel to Optimize Iteration Speed for Recommender Systems [0.0]
We present a novel framework that simplifies the reasoning around the evaluation funnel for a recommendation system. We show that decomposing the definition of success into smaller necessary criteria for success enables early identification of non-successful ideas. We go through so-called offline and online evaluation methods such as counterfactual logging, validation, verification, A/B testing, and interleaving.
arXiv Detail & Related papers (2024-04-03T17:15:45Z)
Position: AI Evaluation Should Learn from How We Test Humans [65.36614996495983]
We argue that psychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today's AI evaluations.
arXiv Detail & Related papers (2023-06-18T09:54:33Z)
Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation [136.16507050034755]
Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale. We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units. We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
arXiv Detail & Related papers (2022-12-15T17:26:05Z)
Counterfactually Evaluating Explanations in Recommender Systems [14.938252589829673]
We propose an offline evaluation method that can be computed without human involvement. We show that, compared to conventional methods, our method can produce evaluation scores more correlated with the real human judgments.
arXiv Detail & Related papers (2022-03-02T18:55:29Z)
From Anecdotal Evidence to Quantitative Evaluation Methods: A Systematic Review on Evaluating Explainable AI [3.7592122147132776]
We identify 12 conceptual properties, such as Compactness and Correctness, that should be evaluated for comprehensively assessing the quality of an explanation. We find that 1 in 3 papers evaluate exclusively with anecdotal evidence, and 1 in 5 papers evaluate with users. This systematic collection of evaluation methods provides researchers and practitioners with concrete tools to thoroughly validate, benchmark and compare new and existing XAI methods.
arXiv Detail & Related papers (2022-01-20T13:23:20Z)
How to Evaluate Your Dialogue Models: A Review of Approaches [2.7834038784275403]
We are first to divide the evaluation methods into three classes, i.e., automatic evaluation, human-involved evaluation and user simulator based evaluation. The existence of benchmarks, suitable for the evaluation of dialogue techniques are also discussed in detail.
arXiv Detail & Related papers (2021-08-03T08:52:33Z)
Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning. ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation. Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z)
Towards Unified Dialogue System Evaluation: A Comprehensive Analysis of Current Evaluation Protocols [17.14709845342071]
The current state of affairs suggests various evaluation protocols to assess chat-oriented dialogue management systems. This paper presents a comprehensive synthesis of both automated and human evaluation methods on dialogue systems.
arXiv Detail & Related papers (2020-06-10T23:29:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.