Related papers: Honest and Reliable Evaluation and Expert Equivalence Testing of Automated Neonatal Seizure Detection

Honest and Reliable Evaluation and Expert Equivalence Testing of Automated Neonatal Seizure Detection

URL: http://arxiv.org/abs/2508.04899v1
Date: Wed, 06 Aug 2025 21:55:28 GMT
Title: Honest and Reliable Evaluation and Expert Equivalence Testing of Automated Neonatal Seizure Detection
Authors: Jovana Kljajic, John M. O'Toole, Robert Hogan, Tamara Skoric,
Abstract summary: Current practices often rely on inconsistent and biased metrics.<n>Expert-level claims about AI performance are frequently made without rigorous validation.<n>This study proposes best practices tailored to the specific challenges of neonatal seizure detection.
Score: 1.4624458429745086
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Reliable evaluation of machine learning models for neonatal seizure detection is critical for clinical adoption. Current practices often rely on inconsistent and biased metrics, hindering model comparability and interpretability. Expert-level claims about AI performance are frequently made without rigorous validation, raising concerns about their reliability. This study aims to systematically evaluate common performance metrics and propose best practices tailored to the specific challenges of neonatal seizure detection. Using real and synthetic seizure annotations, we assessed standard performance metrics, consensus strategies, and human-expert level equivalence tests under varying class imbalance, inter-rater agreement, and number of raters. Matthews and Pearson's correlation coefficients outperformed the area under the curve in reflecting performance under class imbalance. Consensus types are sensitive to the number of raters and agreement level among them. Among human-expert level equivalence tests, the multi-rater Turing test using Fleiss k best captured expert-level AI performance. We recommend reporting: (1) at least one balanced metric, (2) Sensitivity, specificity, PPV and NPV, (3) Multi-rater Turing test results using Fleiss k, and (4) All the above on held-out validation set. This proposed framework provides an important prerequisite to clinical validation by enabling a thorough and honest appraisal of AI methods for neonatal seizure detection.

Related papers

Cohort-attention Evaluation Metric against Tied Data: Studying Performance of Classification Models in Cancer Detection [1.3767986497772466]
We propose the Cohort-Attention Evaluation Metrics (CAT) framework to address these challenges.<n>CAT introduces patient-level assessment, entropy-based distribution weighting, and cohort-weighted sensitivity and specificity.<n>This approach enhances predictive reliability, fairness, and interpretability, providing a robust evaluation method for AI-driven medical screening models.
arXiv Detail & Related papers (2025-03-17T02:50:40Z)
Towards Reliable AI: Adequacy Metrics for Ensuring the Quality of System-level Testing of Autonomous Vehicles [5.634825161148484]
We introduce a set of black-box test adequacy metrics called "Test suite Instance Space Adequacy" (TISA) metrics. The TISA metrics offer a way to assess both the diversity and coverage of the test suite and the range of bugs detected during testing. We evaluate the efficacy of the TISA metrics by examining their correlation with the number of bugs detected in system-level simulation testing of AVs.
arXiv Detail & Related papers (2023-11-14T10:16:05Z)
On Pixel-level Performance Assessment in Anomaly Detection [87.7131059062292]
Anomaly detection methods have demonstrated remarkable success across various applications. However, assessing their performance, particularly at the pixel-level, presents a complex challenge. In this paper, we dissect the intricacies of this challenge, underscored by visual evidence and statistical analysis.
arXiv Detail & Related papers (2023-10-25T08:02:27Z)
Testing the Consistency of Performance Scores Reported for Binary Classification Problems [0.0]
We introduce numerical techniques to assess the consistency of reported performance scores and the assumed experimental setup. We demonstrate how the proposed techniques can effectively detect inconsistencies, thereby safeguarding the integrity of research fields. To benefit the scientific community, we have made the consistency tests available in an open-source Python package.
arXiv Detail & Related papers (2023-10-19T07:04:29Z)
TREEMENT: Interpretable Patient-Trial Matching via Personalized Dynamic Tree-Based Memory Network [54.332862955411656]
Clinical trials are critical for drug development but often suffer from expensive and inefficient patient recruitment. In recent years, machine learning models have been proposed for speeding up patient recruitment via automatically matching patients with clinical trials. We introduce a dynamic tree-based memory network model named TREEMENT to provide accurate and interpretable patient trial matching.
arXiv Detail & Related papers (2023-07-19T12:35:09Z)
Evaluating AI systems under uncertain ground truth: a case study in dermatology [43.8328264420381]
We show that ignoring uncertainty leads to overly optimistic estimates of model performance.<n>In skin condition classification, we find that a large portion of the dataset exhibits significant ground truth uncertainty.
arXiv Detail & Related papers (2023-07-05T10:33:45Z)
On the Universal Adversarial Perturbations for Efficient Data-free Adversarial Detection [55.73320979733527]
We propose a data-agnostic adversarial detection framework, which induces different responses between normal and adversarial samples to UAPs. Experimental results show that our method achieves competitive detection performance on various text classification tasks.
arXiv Detail & Related papers (2023-06-27T02:54:07Z)
Position: AI Evaluation Should Learn from How We Test Humans [65.36614996495983]
We argue that psychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today's AI evaluations.
arXiv Detail & Related papers (2023-06-18T09:54:33Z)
Abnormal-aware Multi-person Evaluation System with Improved Fuzzy Weighting [0.0]
We choose the two-stage screening method, which consists of rough screening and score-weighted Kendall-$tau$ Distance. We use Fuzzy Synthetic Evaluation Method(FSE) to determine the significance of scores given by reviewers as well as their reliability.
arXiv Detail & Related papers (2022-05-01T03:42:43Z)
Estimating and Improving Fairness with Adversarial Learning [65.99330614802388]
We propose an adversarial multi-task training strategy to simultaneously mitigate and detect bias in the deep learning-based medical image analysis system. Specifically, we propose to add a discrimination module against bias and a critical module that predicts unfairness within the base classification model. We evaluate our framework on a large-scale public-available skin lesion dataset.
arXiv Detail & Related papers (2021-03-07T03:10:32Z)
Semi-supervised Medical Image Classification with Relation-driven Self-ensembling Model [71.80319052891817]
We present a relation-driven semi-supervised framework for medical image classification. It exploits the unlabeled data by encouraging the prediction consistency of given input under perturbations. Our method outperforms many state-of-the-art semi-supervised learning methods on both single-label and multi-label image classification scenarios.
arXiv Detail & Related papers (2020-05-15T06:57:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.