Design and evaluation of AI copilots -- case studies of retail copilot templates
- URL: http://arxiv.org/abs/2407.09512v1
- Date: Mon, 17 Jun 2024 17:31:33 GMT
- Title: Design and evaluation of AI copilots -- case studies of retail copilot templates
- Authors: Michal Furmakiewicz, Chang Liu, Angus Taylor, Ilya Venger,
- Abstract summary: Building a successful AI copilot requires a systematic approach.
This paper is divided into two sections, covering the design and evaluation of a copilot respectively.
- Score: 2.7274834772504954
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Building a successful AI copilot requires a systematic approach. This paper is divided into two sections, covering the design and evaluation of a copilot respectively. A case study of developing copilot templates for the retail domain by Microsoft is used to illustrate the role and importance of each aspect. The first section explores the key technical components of a copilot's architecture, including the LLM, plugins for knowledge retrieval and actions, orchestration, system prompts, and responsible AI guardrails. The second section discusses testing and evaluation as a principled way to promote desired outcomes and manage unintended consequences when using AI in a business context. We discuss how to measure and improve its quality and safety, through the lens of an end-to-end human-AI decision loop framework. By providing insights into the anatomy of a copilot and the critical aspects of testing and evaluation, this paper provides concrete evidence of how good design and evaluation practices are essential for building effective, human-centered AI assistants.
Related papers
- A Human Centric Requirements Engineering Framework for Assessing Github Copilot Output [0.0]
GitHub Copilot introduces new challenges in how these software tools address human needs.<n>I analyzed GitHub Copilot's interaction with users through its chat interface.<n>I established a human-centered requirements framework with clear metrics to evaluate these qualities.
arXiv Detail & Related papers (2025-08-05T21:33:23Z) - The AI Imperative: Scaling High-Quality Peer Review in Machine Learning [49.87236114682497]
We argue that AI-assisted peer review must become an urgent research and infrastructure priority.<n>We propose specific roles for AI in enhancing factual verification, guiding reviewer performance, assisting authors in quality improvement, and supporting ACs in decision-making.
arXiv Detail & Related papers (2025-06-09T18:37:14Z) - When Models Know More Than They Can Explain: Quantifying Knowledge Transfer in Human-AI Collaboration [79.69935257008467]
We introduce Knowledge Integration and Transfer Evaluation (KITE), a conceptual and experimental framework for Human-AI knowledge transfer capabilities.<n>We conduct the first large-scale human study (N=118) explicitly designed to measure it.<n>In our two-phase setup, humans first ideate with an AI on problem-solving strategies, then independently implement solutions, isolating model explanations' influence on human understanding.
arXiv Detail & Related papers (2025-06-05T20:48:16Z) - From Coders to Critics: Empowering Students through Peer Assessment in the Age of AI Copilots [3.3094795918443634]
This paper presents an empirical study of a rubric based, anonymized peer review process implemented in a large programming course.<n>Students evaluated each other's final projects (2D game) and their assessments were compared to instructor grades using correlation, mean absolute error, and root mean square error (RMSE)<n>Results show that peer review can approximate instructor evaluation with moderate accuracy and foster student engagement, evaluative thinking, and interest in providing good feedback to their peers.
arXiv Detail & Related papers (2025-05-28T08:17:05Z) - Modeling and Optimizing User Preferences in AI Copilots: A Comprehensive Survey and Taxonomy [5.985777189633703]
AI copilots represent a new generation of AI-powered systems designed to assist users in complex, context-rich tasks.<n>Central to this personalization is preference optimization: the system's ability to detect, interpret, and align with individual user preferences.<n>This survey examines how user preferences are operationalized in AI copilots.
arXiv Detail & Related papers (2025-05-28T02:52:39Z) - ReCopilot: Reverse Engineering Copilot in Binary Analysis [7.589188903601179]
General-purpose large language models (LLMs) perform well in programming analysis on source code.<n>We present ReCopilot, an expert LLM designed for binary analysis tasks.<n>ReCopilot integrates binary code knowledge through a meticulously constructed dataset.
arXiv Detail & Related papers (2025-05-22T08:21:39Z) - A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems [93.8285345915925]
Reasoning is a fundamental cognitive process that enables logical inference, problem-solving, and decision-making.
With the rapid advancement of large language models (LLMs), reasoning has emerged as a key capability that distinguishes advanced AI systems.
We categorize existing methods along two dimensions: (1) Regimes, which define the stage at which reasoning is achieved; and (2) Architectures, which determine the components involved in the reasoning process.
arXiv Detail & Related papers (2025-04-12T01:27:49Z) - A Comprehensive Evaluation of Four End-to-End AI Autopilots Using CCTest and the Carla Leaderboard [6.229766691427486]
End-to-end AI autopilots for autonomous driving systems have emerged as a promising alternative to traditional modular autopilots.
They suffer from the well-known problems of AI systems such as non-determinism, non-explainability, and anomalies.
This paper extends a study of the critical configuration testing approach that has been applied to four open modular autopilots.
arXiv Detail & Related papers (2025-01-21T12:33:32Z) - Scaling of Search and Learning: A Roadmap to Reproduce o1 from Reinforcement Learning Perspective [77.94874338927492]
OpenAI has claimed that the main techinique behinds o1 is the reinforcement learning.
This paper analyzes the roadmap to achieving o1 from the perspective of reinforcement learning.
arXiv Detail & Related papers (2024-12-18T18:24:47Z) - Coverage-Constrained Human-AI Cooperation with Multiple Experts [21.247853435529446]
We propose the Coverage-constrained Learning to Defer and Complement with Specific Experts (CL2DC) method.
CL2DC makes final decisions through either AI prediction alone or by deferring to or complementing a specific expert.
It achieves superior performance compared to state-of-the-art HAI-CC methods.
arXiv Detail & Related papers (2024-11-18T19:06:01Z) - Combining AI Control Systems and Human Decision Support via Robustness and Criticality [53.10194953873209]
We extend a methodology for adversarial explanations (AE) to state-of-the-art reinforcement learning frameworks.
We show that the learned AI control system demonstrates robustness against adversarial tampering.
In a training / learning framework, this technology can improve both the AI's decisions and explanations through human interaction.
arXiv Detail & Related papers (2024-07-03T15:38:57Z) - Testing autonomous vehicles and AI: perspectives and challenges from cybersecurity, transparency, robustness and fairness [53.91018508439669]
The study explores the complexities of integrating Artificial Intelligence into Autonomous Vehicles (AVs)
It examines the challenges introduced by AI components and the impact on testing procedures.
The paper identifies significant challenges and suggests future directions for research and development of AI in AV technology.
arXiv Detail & Related papers (2024-02-21T08:29:42Z) - Healthcare Copilot: Eliciting the Power of General LLMs for Medical
Consultation [96.22329536480976]
We introduce the construction of a Healthcare Copilot designed for medical consultation.
The proposed Healthcare Copilot comprises three main components: 1) the Dialogue component, responsible for effective and safe patient interactions; 2) the Memory component, storing both current conversation data and historical patient information; and 3) the Processing component, summarizing the entire dialogue and generating reports.
To evaluate the proposed Healthcare Copilot, we implement an auto-evaluation scheme using ChatGPT for two roles: as a virtual patient engaging in dialogue with the copilot, and as an evaluator to assess the quality of the dialogue.
arXiv Detail & Related papers (2024-02-20T22:26:35Z) - PADTHAI-MM: A Principled Approach for Designing Trustable,
Human-centered AI systems using the MAST Methodology [5.38932801848643]
The Multisource AI Scorecard Table (MAST), a checklist rating system, addresses this gap in designing and evaluating AI-enabled decision support systems.
We propose the Principled Approach for Designing Trustable Human-centered AI systems using MAST methodology.
We show that MAST-guided design can improve trust perceptions, and that MAST criteria can be linked to performance, process, and purpose information.
arXiv Detail & Related papers (2024-01-24T23:15:44Z) - Student Mastery or AI Deception? Analyzing ChatGPT's Assessment
Proficiency and Evaluating Detection Strategies [1.633179643849375]
Generative AI systems such as ChatGPT have a disruptive effect on learning and assessment.
This work investigates the performance of ChatGPT by evaluating it across three courses.
arXiv Detail & Related papers (2023-11-27T20:10:13Z) - Unity is Strength: Cross-Task Knowledge Distillation to Improve Code
Review Generation [0.9208007322096533]
We propose a novel deep-learning architecture, DISCOREV, based on cross-task knowledge distillation.
In our approach, the fine-tuning of the comment generation model is guided by the code refinement model.
Our results show that our approach generates better review comments as measured by the BLEU score.
arXiv Detail & Related papers (2023-09-06T21:10:33Z) - Learning Action-Effect Dynamics for Hypothetical Vision-Language
Reasoning Task [50.72283841720014]
We propose a novel learning strategy that can improve reasoning about the effects of actions.
We demonstrate the effectiveness of our proposed approach and discuss its advantages over previous baselines in terms of performance, data efficiency, and generalization capability.
arXiv Detail & Related papers (2022-12-07T05:41:58Z) - Human-Centered AI for Data Science: A Systematic Approach [48.71756559152512]
Human-Centered AI (HCAI) refers to the research effort that aims to design and implement AI techniques to support various human tasks.
We illustrate how we approach HCAI using a series of research projects around Data Science (DS) works as a case study.
arXiv Detail & Related papers (2021-10-03T21:47:13Z) - Leveraging Expert Consistency to Improve Algorithmic Decision Support [62.61153549123407]
We explore the use of historical expert decisions as a rich source of information that can be combined with observed outcomes to narrow the construct gap.
We propose an influence function-based methodology to estimate expert consistency indirectly when each case in the data is assessed by a single expert.
Our empirical evaluation, using simulations in a clinical setting and real-world data from the child welfare domain, indicates that the proposed approach successfully narrows the construct gap.
arXiv Detail & Related papers (2021-01-24T05:40:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.