Manifesto from Dagstuhl Perspectives Workshop 24352 -- Conversational Agents: A Framework for Evaluation (CAFE)
- URL: http://arxiv.org/abs/2506.11112v1
- Date: Sun, 08 Jun 2025 16:25:35 GMT
- Title: Manifesto from Dagstuhl Perspectives Workshop 24352 -- Conversational Agents: A Framework for Evaluation (CAFE)
- Authors: Christine Bauer, Li Chen, Nicola Ferro, Norbert Fuhr, Avishek Anand, Timo Breuer, Guglielmo Faggioli, Ophir Frieder, Hideo Joho, Jussi Karlgren, Johannes Kiesel, Bart P. Knijnenburg, Aldo Lipani, Lien Michiels, Andrea Papenmeier, Maria Soledad Pera, Mark Sanderson, Scott Sanner, Benno Stein, Johanne R. Trippas, Karin Verspoor, Martijn C Willemsen,
- Abstract summary: We defined the Conversational Agents Framework for Evaluation (CAFE) for the evaluation of CONIAC systems.<n>CAFE consists of six major components: 1) goals of the system's stakeholders, 2) user tasks to be studied in the evaluation, 3) aspects of the users carrying out the tasks, 4) evaluation criteria to be considered, 5) evaluation methodology to be applied, and 6) measures for the quantitative criteria chosen.
- Score: 59.64777874324281
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: During the workshop, we deeply discussed what CONversational Information ACcess (CONIAC) is and its unique features, proposing a world model abstracting it, and defined the Conversational Agents Framework for Evaluation (CAFE) for the evaluation of CONIAC systems, consisting of six major components: 1) goals of the system's stakeholders, 2) user tasks to be studied in the evaluation, 3) aspects of the users carrying out the tasks, 4) evaluation criteria to be considered, 5) evaluation methodology to be applied, and 6) measures for the quantitative criteria chosen.
Related papers
- AGACCI : Affiliated Grading Agents for Criteria-Centric Interface in Educational Coding Contexts [0.6050976240234864]
We introduce AGACCI, a multi-agent system that distributes specialized evaluation roles across collaborative agents.<n>AGACCI outperforms a single GPT-based baseline in terms of rubric and feedback accuracy, relevance, consistency, and coherence.
arXiv Detail & Related papers (2025-07-07T15:50:46Z) - Evaluating LLM-based Agents for Multi-Turn Conversations: A Survey [64.08485471150486]
This survey examines evaluation methods for large language model (LLM)-based agents in multi-turn conversational settings.<n>We systematically reviewed nearly 250 scholarly sources, capturing the state of the art from various venues of publication.
arXiv Detail & Related papers (2025-03-28T14:08:40Z) - SPHERE: An Evaluation Card for Human-AI Systems [75.0887588648484]
We present an evaluation card SPHERE, which encompasses five key dimensions.<n>We conduct a review of 39 human-AI systems using SPHERE, outlining current evaluation practices and areas for improvement.
arXiv Detail & Related papers (2025-03-24T20:17:20Z) - Large Language Models as Evaluators for Conversational Recommender Systems: Benchmarking System Performance from a User-Centric Perspective [38.940283784200005]
This study proposes an automated LLM-based CRS evaluation framework.<n>It builds upon existing research in human-computer interaction and psychology.<n>We use this framework to evaluate four different conversational recommender systems.
arXiv Detail & Related papers (2025-01-16T12:06:56Z) - HREF: Human Response-Guided Evaluation of Instruction Following in Language Models [61.273153125847166]
We develop a new evaluation benchmark, Human Response-Guided Evaluation of Instruction Following (HREF)<n>In addition to providing reliable evaluation, HREF emphasizes individual task performance and is free from contamination.<n>We study the impact of key design choices in HREF, including the size of the evaluation set, the judge model, the baseline model, and the prompt template.
arXiv Detail & Related papers (2024-12-20T03:26:47Z) - Concept -- An Evaluation Protocol on Conversational Recommender Systems with System-centric and User-centric Factors [68.68418801681965]
We propose a new and inclusive evaluation protocol, Concept, which integrates both system- and user-centric factors.
Our protocol, Concept, serves a dual purpose. First, it provides an overview of the pros and cons in current CRS models.
Second, it pinpoints the problem of low usability in the "omnipotent" ChatGPT and offers a comprehensive reference guide for evaluating CRS.
arXiv Detail & Related papers (2024-04-04T08:56:48Z) - Evaluating Task-oriented Dialogue Systems: A Systematic Review of Measures, Constructs and their Operationalisations [2.6122764214161363]
This review provides an overview of the used constructs and metrics in previous work.
It also discusses challenges in the context of dialogue system evaluation.
It develops a research agenda for the future of dialogue system evaluation.
arXiv Detail & Related papers (2023-12-21T14:15:46Z) - The Value-Sensitive Conversational Agent Co-Design Framework [4.9186105778865645]
This paper presents the Value-Sensitive Conversational Agent (VSCA) Framework for enabling the collaborative design (co-design) of value-sensitive CAs with relevant stakeholders.
The framework facilitates the co-design of three artefacts that elicit stakeholder values and have a technical utility to CA teams to guide CA implementation.
arXiv Detail & Related papers (2023-10-18T09:58:39Z) - Revisiting the Gold Standard: Grounding Summarization Evaluation with
Robust Human Evaluation [136.16507050034755]
Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale.
We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units.
We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
arXiv Detail & Related papers (2022-12-15T17:26:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.