SLEEPYLAND: trust begins with fair evaluation of automatic sleep staging models
- URL: http://arxiv.org/abs/2506.08574v2
- Date: Wed, 11 Jun 2025 13:12:29 GMT
- Title: SLEEPYLAND: trust begins with fair evaluation of automatic sleep staging models
- Authors: Alvise Dei Rossi, Matteo Metaldi, Michal Bechny, Irina Filchenko, Julia van der Meer, Markus H. Schmidt, Claudio L. A. Bassetti, Athina Tzovara, Francesca D. Faraci, Luigi Fiorillo,
- Abstract summary: We present SLEEPYLAND, an open-source sleep staging evaluation framework.<n>It includes more than 220'000 hours in-domain (ID) sleep recordings, and more than 84'000 hours out-of-domain (OOD) sleep recordings.<n>We introduce SOMNUS, an ensemble combining models across architectures and channel setups via soft voting.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite advances in deep learning for automatic sleep staging, clinical adoption remains limited due to challenges in fair model evaluation, generalization across diverse datasets, model bias, and variability in human annotations. We present SLEEPYLAND, an open-source sleep staging evaluation framework designed to address these barriers. It includes more than 220'000 hours in-domain (ID) sleep recordings, and more than 84'000 hours out-of-domain (OOD) sleep recordings, spanning a broad range of ages, sleep-wake disorders, and hardware setups. We release pre-trained models based on high-performing SoA architectures and evaluate them under standardized conditions across single- and multi-channel EEG/EOG configurations. We introduce SOMNUS, an ensemble combining models across architectures and channel setups via soft voting. SOMNUS achieves robust performance across twenty-four different datasets, with macro-F1 scores between 68.7% and 87.2%, outperforming individual models in 94.9% of cases. Notably, SOMNUS surpasses previous SoA methods, even including cases where compared models were trained ID while SOMNUS treated the same data as OOD. Using a subset of the BSWR (N=6'633), we quantify model biases linked to age, gender, AHI, and PLMI, showing that while ensemble improves robustness, no model architecture consistently minimizes bias in performance and clinical markers estimation. In evaluations on OOD multi-annotated datasets (DOD-H, DOD-O), SOMNUS exceeds the best human scorer, i.e., MF1 85.2% vs 80.8% on DOD-H, and 80.2% vs 75.9% on DOD-O, better reproducing the scorer consensus than any individual expert (k = 0.89/0.85 and ACS = 0.95/0.94 for healthy/OSA cohorts). Finally, we introduce ensemble disagreement metrics - entropy and inter-model divergence based - predicting regions of scorer disagreement with ROC AUCs up to 0.828, offering a data-driven proxy for human uncertainty.
Related papers
- Efficient Federated Learning with Heterogeneous Data and Adaptive Dropout [62.73150122809138]
Federated Learning (FL) is a promising distributed machine learning approach that enables collaborative training of a global model using multiple edge devices.<n>We propose the FedDHAD FL framework, which comes with two novel methods: Dynamic Heterogeneous model aggregation (FedDH) and Adaptive Dropout (FedAD)<n>The combination of these two methods makes FedDHAD significantly outperform state-of-the-art solutions in terms of accuracy (up to 6.7% higher), efficiency (up to 2.02 times faster), and cost (up to 15.0% smaller)
arXiv Detail & Related papers (2025-07-14T16:19:00Z) - WorldPM: Scaling Human Preference Modeling [130.23230492612214]
We propose World Preference Modeling$ (WorldPM) to emphasize this scaling potential.<n>We collect preference data from public forums covering diverse user communities.<n>We conduct extensive training using 15M-scale data across models ranging from 1.5B to 72B parameters.
arXiv Detail & Related papers (2025-05-15T17:38:37Z) - Segment-and-Classify: ROI-Guided Generalizable Contrast Phase Classification in CT Using XGBoost [7.689389068258514]
This study utilized three public CT datasets from separate institutions.<n>The phase prediction model was trained on the WAW-TACE dataset and validated on the VinDr-Multiphase and C4KC-KiTS datasets.
arXiv Detail & Related papers (2025-01-23T20:01:33Z) - Early Diagnosis of Alzheimer's Diseases and Dementia from MRI Images Using an Ensemble Deep Learning [0.7510165488300369]
Alzheimer's Disease (AD) is a progressive neurological disorder that can result in significant cognitive impairment and dementia.<n>In this study, we proposed two CNNs, IR-BRAINNET and Modified-DEMNET, designed to detect the early stages of AD accurately.<n>We also introduced an ensemble model that averages their outputs to reduce variance across the CNNs and enhance AD detection.
arXiv Detail & Related papers (2024-12-07T14:27:41Z) - An AI-enabled Bias-Free Respiratory Disease Diagnosis Model using Cough
Audio: A Case Study for COVID-19 [1.1146119513912156]
We propose the Bias Free Network (RBFNet) to mitigate the impact of confounders in the training data distribution.
RBFNet ensures accurate and unbiased RD diagnosis features, emphasizing its relevance by incorporating a COVID19 dataset.
An additional bias predictor is incorporated in the classification scheme to formulate a conditional Generative Adrial Network (cGAN)
arXiv Detail & Related papers (2024-01-04T13:09:45Z) - On the explainability of hospitalization prediction on a large COVID-19
patient dataset [45.82374977939355]
We develop various AI models to predict hospitalization on a large (over 110$k$) cohort of COVID-19 positive-tested US patients.
Despite high data unbalance, the models reach average precision 0.96-0.98 (0.75-0.85), recall 0.96-0.98 (0.74-0.85), and $F_score 0.97-0.98 (0.79-0.83) on the non-hospitalized (or hospitalized) class.
arXiv Detail & Related papers (2021-10-28T10:23:38Z) - Pediatric Automatic Sleep Staging: A comparative study of
state-of-the-art deep learning methods [16.651453507701966]
We conduct a large-scale comparative study on the state-of-the-art deep learning methods for pediatric automatic sleep staging.
A selection of six different deep neural networks with diverging features are adopted to evaluate a sample of more than 1,200 children.
Experiments show that the performance of automated pediatric sleep staging when evaluated on new subjects is equivalent to the expert-level one reported on adults.
arXiv Detail & Related papers (2021-08-23T15:39:48Z) - Sleep Staging Based on Serialized Dual Attention Network [0.0]
We propose a deep learning model SDAN based on raw EEG.
It serially combines the channel attention and spatial attention mechanisms to filter and highlight key information.
It achieves excellent results in the N1 sleep stage compared to other methods.
arXiv Detail & Related papers (2021-07-18T13:18:12Z) - Convolutional Neural Networks for Sleep Stage Scoring on a Two-Channel
EEG Signal [63.18666008322476]
Sleep problems are one of the major diseases all over the world.
Basic tool used by specialists is the Polysomnogram, which is a collection of different signals recorded during sleep.
Specialists have to score the different signals according to one of the standard guidelines.
arXiv Detail & Related papers (2021-03-30T09:59:56Z) - MSED: a multi-modal sleep event detection model for clinical sleep
analysis [62.997667081978825]
We designed a single deep neural network architecture to jointly detect sleep events in a polysomnogram.
The performance of the model was quantified by F1, precision, and recall scores, and by correlating index values to clinical values.
arXiv Detail & Related papers (2021-01-07T13:08:44Z) - Automatic sleep stage classification with deep residual networks in a
mixed-cohort setting [63.52264764099532]
We developed a novel deep neural network model to assess the generalizability of several large-scale cohorts.
Overall classification accuracy improved with increasing fractions of training data.
arXiv Detail & Related papers (2020-08-21T10:48:35Z) - Personalized Automatic Sleep Staging with Single-Night Data: a Pilot
Study with KL-Divergence Regularization [18.754100926147903]
We propose a Kullback-Leibler (KL) divergence regularized transfer learning approach to address this problem.
We employ the pretrained SeqSleepNet as a starting point and finetune it with the single-night personalization data to derive the personalized model.
Experimental results on the Sleep-EDF Expanded database with 75 subjects show that sleep staging personalization with a single-night data is possible with help of the proposed KL-divergence regularization.
arXiv Detail & Related papers (2020-04-23T17:48:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.