Related papers: Design and evaluation of AI copilots -- case studies of retail copilot templates

Design and evaluation of AI copilots -- case studies of retail copilot templates

URL: http://arxiv.org/abs/2407.09512v1
Date: Mon, 17 Jun 2024 17:31:33 GMT
Title: Design and evaluation of AI copilots -- case studies of retail copilot templates
Authors: Michal Furmakiewicz, Chang Liu, Angus Taylor, Ilya Venger,
Abstract summary: Building a successful AI copilot requires a systematic approach. This paper is divided into two sections, covering the design and evaluation of a copilot respectively.
Score: 2.7274834772504954
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Building a successful AI copilot requires a systematic approach. This paper is divided into two sections, covering the design and evaluation of a copilot respectively. A case study of developing copilot templates for the retail domain by Microsoft is used to illustrate the role and importance of each aspect. The first section explores the key technical components of a copilot's architecture, including the LLM, plugins for knowledge retrieval and actions, orchestration, system prompts, and responsible AI guardrails. The second section discusses testing and evaluation as a principled way to promote desired outcomes and manage unintended consequences when using AI in a business context. We discuss how to measure and improve its quality and safety, through the lens of an end-to-end human-AI decision loop framework. By providing insights into the anatomy of a copilot and the critical aspects of testing and evaluation, this paper provides concrete evidence of how good design and evaluation practices are essential for building effective, human-centered AI assistants.

Related papers

A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems [93.8285345915925]
Reasoning is a fundamental cognitive process that enables logical inference, problem-solving, and decision-making. With the rapid advancement of large language models (LLMs), reasoning has emerged as a key capability that distinguishes advanced AI systems. We categorize existing methods along two dimensions: (1) Regimes, which define the stage at which reasoning is achieved; and (2) Architectures, which determine the components involved in the reasoning process.
arXiv Detail & Related papers (2025-04-12T01:27:49Z)
A Comprehensive Evaluation of Four End-to-End AI Autopilots Using CCTest and the Carla Leaderboard [6.229766691427486]
End-to-end AI autopilots for autonomous driving systems have emerged as a promising alternative to traditional modular autopilots. They suffer from the well-known problems of AI systems such as non-determinism, non-explainability, and anomalies. This paper extends a study of the critical configuration testing approach that has been applied to four open modular autopilots.
arXiv Detail & Related papers (2025-01-21T12:33:32Z)
Scaling of Search and Learning: A Roadmap to Reproduce o1 from Reinforcement Learning Perspective [77.94874338927492]
OpenAI has claimed that the main techinique behinds o1 is the reinforcement learning. This paper analyzes the roadmap to achieving o1 from the perspective of reinforcement learning.
arXiv Detail & Related papers (2024-12-18T18:24:47Z)
Coverage-Constrained Human-AI Cooperation with Multiple Experts [21.247853435529446]
We propose the Coverage-constrained Learning to Defer and Complement with Specific Experts (CL2DC) method. CL2DC makes final decisions through either AI prediction alone or by deferring to or complementing a specific expert. It achieves superior performance compared to state-of-the-art HAI-CC methods.
arXiv Detail & Related papers (2024-11-18T19:06:01Z)
Combining AI Control Systems and Human Decision Support via Robustness and Criticality [53.10194953873209]
We extend a methodology for adversarial explanations (AE) to state-of-the-art reinforcement learning frameworks. We show that the learned AI control system demonstrates robustness against adversarial tampering. In a training / learning framework, this technology can improve both the AI's decisions and explanations through human interaction.
arXiv Detail & Related papers (2024-07-03T15:38:57Z)
Testing autonomous vehicles and AI: perspectives and challenges from cybersecurity, transparency, robustness and fairness [53.91018508439669]
The study explores the complexities of integrating Artificial Intelligence into Autonomous Vehicles (AVs) It examines the challenges introduced by AI components and the impact on testing procedures. The paper identifies significant challenges and suggests future directions for research and development of AI in AV technology.
arXiv Detail & Related papers (2024-02-21T08:29:42Z)
Healthcare Copilot: Eliciting the Power of General LLMs for Medical Consultation [96.22329536480976]
We introduce the construction of a Healthcare Copilot designed for medical consultation. The proposed Healthcare Copilot comprises three main components: 1) the Dialogue component, responsible for effective and safe patient interactions; 2) the Memory component, storing both current conversation data and historical patient information; and 3) the Processing component, summarizing the entire dialogue and generating reports. To evaluate the proposed Healthcare Copilot, we implement an auto-evaluation scheme using ChatGPT for two roles: as a virtual patient engaging in dialogue with the copilot, and as an evaluator to assess the quality of the dialogue.
arXiv Detail & Related papers (2024-02-20T22:26:35Z)
PADTHAI-MM: A Principled Approach for Designing Trustable, Human-centered AI systems using the MAST Methodology [5.38932801848643]
The Multisource AI Scorecard Table (MAST), a checklist rating system, addresses this gap in designing and evaluating AI-enabled decision support systems. We propose the Principled Approach for Designing Trustable Human-centered AI systems using MAST methodology. We show that MAST-guided design can improve trust perceptions, and that MAST criteria can be linked to performance, process, and purpose information.
arXiv Detail & Related papers (2024-01-24T23:15:44Z)
Student Mastery or AI Deception? Analyzing ChatGPT's Assessment Proficiency and Evaluating Detection Strategies [1.633179643849375]
Generative AI systems such as ChatGPT have a disruptive effect on learning and assessment. This work investigates the performance of ChatGPT by evaluating it across three courses.
arXiv Detail & Related papers (2023-11-27T20:10:13Z)
Unity is Strength: Cross-Task Knowledge Distillation to Improve Code Review Generation [0.9208007322096533]
We propose a novel deep-learning architecture, DISCOREV, based on cross-task knowledge distillation. In our approach, the fine-tuning of the comment generation model is guided by the code refinement model. Our results show that our approach generates better review comments as measured by the BLEU score.
arXiv Detail & Related papers (2023-09-06T21:10:33Z)
Learning Action-Effect Dynamics for Hypothetical Vision-Language Reasoning Task [50.72283841720014]
We propose a novel learning strategy that can improve reasoning about the effects of actions. We demonstrate the effectiveness of our proposed approach and discuss its advantages over previous baselines in terms of performance, data efficiency, and generalization capability.
arXiv Detail & Related papers (2022-12-07T05:41:58Z)
Human-Centered AI for Data Science: A Systematic Approach [48.71756559152512]
Human-Centered AI (HCAI) refers to the research effort that aims to design and implement AI techniques to support various human tasks. We illustrate how we approach HCAI using a series of research projects around Data Science (DS) works as a case study.
arXiv Detail & Related papers (2021-10-03T21:47:13Z)
Leveraging Expert Consistency to Improve Algorithmic Decision Support [62.61153549123407]
We explore the use of historical expert decisions as a rich source of information that can be combined with observed outcomes to narrow the construct gap. We propose an influence function-based methodology to estimate expert consistency indirectly when each case in the data is assessed by a single expert. Our empirical evaluation, using simulations in a clinical setting and real-world data from the child welfare domain, indicates that the proposed approach successfully narrows the construct gap.
arXiv Detail & Related papers (2021-01-24T05:40:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.