Related papers: Human-centred test and evaluation of military AI

Human-centred test and evaluation of military AI

URL: http://arxiv.org/abs/2412.01978v1
Date: Mon, 02 Dec 2024 21:14:55 GMT
Title: Human-centred test and evaluation of military AI
Authors: David Helmer, Michael Boardman, S. Kate Conroy, Adam J. Hepworth, Manoj Harjani,
Abstract summary: The REAIM 2024 Blueprint for Action states that AI applications in the military domain should be ethical and human-centric.<n>TEVV in the development and deployment of AI systems needs to involve human users throughout the lifecycle.<n>Traditional human-centred test and evaluation methods from human factors need to be adapted for deployed AI systems.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The REAIM 2024 Blueprint for Action states that AI applications in the military domain should be ethical and human-centric and that humans must remain responsible and accountable for their use and effects. Developing rigorous test and evaluation, verification and validation (TEVV) frameworks will contribute to robust oversight mechanisms. TEVV in the development and deployment of AI systems needs to involve human users throughout the lifecycle. Traditional human-centred test and evaluation methods from human factors need to be adapted for deployed AI systems that require ongoing monitoring and evaluation. The language around AI-enabled systems should be shifted to inclusion of the human(s) as a component of the system. Standards and requirements supporting this adjusted definition are needed, as are metrics and means to evaluate them. The need for dialogue between technologists and policymakers on human-centred TEVV will be evergreen, but dialogue needs to be initiated with an objective in mind for it to be productive. Development of TEVV throughout system lifecycle is critical to support this evolution including the issue of human scalability and impact on scale of achievable testing. Communication between technical and non technical communities must be improved to ensure operators and policy-makers understand risk assumed by system use and to better inform research and development. Test and evaluation in support of responsible AI deployment must include the effect of the human to reflect operationally realised system performance. Means of communicating the results of TEVV to those using and making decisions regarding the use of AI based systems will be key in informing risk based decisions regarding use.

Related papers

Exploring Big Five Personality and AI Capability Effects in LLM-Simulated Negotiation Dialogues [16.07828032939124]
This paper presents an evaluation framework for agentic AI systems in mission-critical negotiation contexts.<n>Using Sotopia as a simulation testbed, we present two experiments that systematically evaluated how personality traits and AI agent characteristics influence social negotiation outcomes.
arXiv Detail & Related papers (2025-06-19T00:14:56Z)
Robot-Gated Interactive Imitation Learning with Adaptive Intervention Mechanism [48.41735416075536]
Interactive Imitation Learning (IIL) allows agents to acquire desired behaviors through human interventions.<n>We propose the Adaptive Intervention Mechanism (AIM), a novel robot-gated IIL algorithm that learns an adaptive criterion for requesting human demonstrations.
arXiv Detail & Related papers (2025-06-10T18:43:26Z)
The AI Imperative: Scaling High-Quality Peer Review in Machine Learning [49.87236114682497]
We argue that AI-assisted peer review must become an urgent research and infrastructure priority.<n>We propose specific roles for AI in enhancing factual verification, guiding reviewer performance, assisting authors in quality improvement, and supporting ACs in decision-making.
arXiv Detail & Related papers (2025-06-09T18:37:14Z)
Towards User-Centred Design of AI-Assisted Decision-Making in Law Enforcement [1.1890528509539204]
User requirements for AI-assisted systems in law enforcement remain unclear. Participants in our study highlighted the need for a system capable of processing and analysing large volumes of data efficiently. We argue that it is very unlikely that the system will ever achieve full automation due to the dynamic and complex nature of the law enforcement domain.
arXiv Detail & Related papers (2025-04-24T09:25:29Z)
On Benchmarking Human-Like Intelligence in Machines [77.55118048492021]
We argue that current AI evaluation paradigms are insufficient for assessing human-like cognitive capabilities. We identify a set of key shortcomings: a lack of human-validated labels, inadequate representation of human response variability and uncertainty, and reliance on simplified and ecologically-invalid tasks.
arXiv Detail & Related papers (2025-02-27T20:21:36Z)
AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons [62.374792825813394]
This paper introduces AILuminate v1.0, the first comprehensive industry-standard benchmark for assessing AI-product risk and reliability. The benchmark evaluates an AI system's resistance to prompts designed to elicit dangerous, illegal, or undesirable behavior in 12 hazard categories.
arXiv Detail & Related papers (2025-02-19T05:58:52Z)
To Err Is AI! Debugging as an Intervention to Facilitate Appropriate Reliance on AI Systems [11.690126756498223]
Vision for optimal human-AI collaboration requires 'appropriate reliance' of humans on AI systems. In practice, the performance disparity of machine learning models on out-of-distribution data makes dataset-specific performance feedback unreliable.
arXiv Detail & Related papers (2024-09-22T09:43:27Z)
Combining AI Control Systems and Human Decision Support via Robustness and Criticality [53.10194953873209]
We extend a methodology for adversarial explanations (AE) to state-of-the-art reinforcement learning frameworks. We show that the learned AI control system demonstrates robustness against adversarial tampering. In a training / learning framework, this technology can improve both the AI's decisions and explanations through human interaction.
arXiv Detail & Related papers (2024-07-03T15:38:57Z)
ConSiDERS-The-Human Evaluation Framework: Rethinking Human Evaluation for Generative Large Language Models [53.00812898384698]
We argue that human evaluation of generative large language models (LLMs) should be a multidisciplinary undertaking. We highlight how cognitive biases can conflate fluent information and truthfulness, and how cognitive uncertainty affects the reliability of rating scores such as Likert. We propose the ConSiDERS-The-Human evaluation framework consisting of 6 pillars -- Consistency, Scoring Criteria, Differentiating, User Experience, Responsible, and Scalability.
arXiv Detail & Related papers (2024-05-28T22:45:28Z)
An AI System Evaluation Framework for Advancing AI Safety: Terminology, Taxonomy, Lifecycle Mapping [23.92695048003188]
This paper proposes a framework for AI system evaluation comprising three components. This framework catalyses a deeper discourse on AI system evaluation beyond model-centric approaches.
arXiv Detail & Related papers (2024-04-08T10:49:59Z)
Testing autonomous vehicles and AI: perspectives and challenges from cybersecurity, transparency, robustness and fairness [53.91018508439669]
The study explores the complexities of integrating Artificial Intelligence into Autonomous Vehicles (AVs) It examines the challenges introduced by AI components and the impact on testing procedures. The paper identifies significant challenges and suggests future directions for research and development of AI in AV technology.
arXiv Detail & Related papers (2024-02-21T08:29:42Z)
Requirements for Explainability and Acceptance of Artificial Intelligence in Collaborative Work [0.0]
The present structured literature analysis examines the requirements for the explainability and acceptance of AI. Results indicate that the two main groups of users are developers who require information about the internal operations of the model. The acceptance of AI systems depends on information about the system's functions and performance, privacy and ethical considerations.
arXiv Detail & Related papers (2023-06-27T11:36:07Z)
Model evaluation for extreme risks [46.53170857607407]
Further progress in AI development could lead to capabilities that pose extreme risks, such as offensive cyber capabilities or strong manipulation skills. We explain why model evaluation is critical for addressing extreme risks.
arXiv Detail & Related papers (2023-05-24T16:38:43Z)
Cybertrust: From Explainable to Actionable and Interpretable AI (AI2) [58.981120701284816]
Actionable and Interpretable AI (AI2) will incorporate explicit quantifications and visualizations of user confidence in AI recommendations. It will allow examining and testing of AI system predictions to establish a basis for trust in the systems' decision making.
arXiv Detail & Related papers (2022-01-26T18:53:09Z)
AAAI FSS-19: Human-Centered AI: Trustworthiness of AI Models and Data Proceedings [8.445274192818825]
It is crucial for predictive models to be uncertainty-aware and yield trustworthy predictions. The focus of this symposium was on AI systems to improve data quality and technical robustness and safety. submissions from broadly defined areas also discussed approaches addressing requirements such as explainable models, human trust and ethical aspects of AI.
arXiv Detail & Related papers (2020-01-15T15:30:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.