Beyond Accuracy: Automated De-Identification of Large Real-World
Clinical Text Datasets
- URL: http://arxiv.org/abs/2312.08495v1
- Date: Wed, 13 Dec 2023 20:15:29 GMT
- Title: Beyond Accuracy: Automated De-Identification of Large Real-World
Clinical Text Datasets
- Authors: Veysel Kocaman, Hasham Ul Haq, David Talby
- Abstract summary: This paper summarizes lessons learned from building a system used to de-identify over one billion real clinical notes.
A fully automated solution requires a very high level of accuracy that does not require manual review.
- Score: 7.6631083158336715
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent research advances achieve human-level accuracy for de-identifying
free-text clinical notes on research datasets, but gaps remain in reproducing
this in large real-world settings. This paper summarizes lessons learned from
building a system used to de-identify over one billion real clinical notes, in
a fully automated way, that was independently certified by multiple
organizations for production use. A fully automated solution requires a very
high level of accuracy that does not require manual review. A hybrid
context-based model architecture is described, which outperforms a Named Entity
Recogniton (NER) - only model by 10% on the i2b2-2014 benchmark. The proposed
system makes 50%, 475%, and 575% fewer errors than the comparable AWS, Azure,
and GCP services respectively while also outperforming ChatGPT by 33%. It
exceeds 98% coverage of sensitive data across 7 European languages, without a
need for fine tuning. A second set of described models enable data obfuscation
-- replacing sensitive data with random surrogates -- while retaining name,
date, gender, clinical, and format consistency. Both the practical need and the
solution architecture that provides for reliable & linked anonymized documents
are described.
Related papers
- Can Reasoning LLMs Enhance Clinical Document Classification? [7.026393789313748]
Large Language Models (LLMs) offer promising improvements in accuracy and efficiency for this task.
This study evaluates the performance and consistency of eight LLMs; four reasoning (Qwen QWQ, Deepseek Reasoner, GPT o3 Mini, Gemini 2.0 Flash Thinking) and four non-reasoning (Llama 3.3, GPT 4o Mini, Gemini 2.0 Flash, Deepseek Chat)
Results showed that reasoning models outperformed non-reasoning models in accuracy (71% vs 68%) and F1 score (67% vs 60%)
arXiv Detail & Related papers (2025-04-10T18:00:27Z) - Uncertainty-aware Long-tailed Weights Model the Utility of Pseudo-labels for Semi-supervised Learning [50.868594148443215]
We propose an Uncertainty-aware Ensemble Structure (UES) to assess the utility of pseudo-labels for unlabeled samples.
UES is lightweight and architecture-agnostic, easily extending to various computer vision tasks, including classification and regression.
arXiv Detail & Related papers (2025-03-13T02:21:04Z) - Embedding-Driven Diversity Sampling to Improve Few-Shot Synthetic Data Generation [4.684310901243605]
We propose an embedding-driven approach that uses diversity sampling from a small set of real clinical notes to guide large language models in few-shot prompting.
Using cosine similarity and a Turing test, our approach produced synthetic notes that more closely align with real clinical text.
arXiv Detail & Related papers (2025-01-20T00:16:57Z) - Automated, Unsupervised, and Auto-parameterized Inference of Data Patterns and Anomaly Detection [6.454528834218153]
RIOLU is fully automated, automatically parameterized, and does not need labeled samples.
RIOLU can generate precise patterns from datasets in various domains, with a high F1 score of 97.2%.
A variant of RIOLU, with user guidance, can further boost its precision, with up to 37.4% improvement in terms of F1.
arXiv Detail & Related papers (2024-12-06T18:18:26Z) - Improving the accuracy of automated labeling of specimen images datasets via a confidence-based process [9.0255922670433]
We present and validate an approach that can greatly improve automatic labeling accuracy.
We demonstrate that a naive model that produced 86% initial accuracy can achieve improved performance.
After validating the approach in a number of ways, we annotate a large dataset of over 600,000 herbarium specimens.
arXiv Detail & Related papers (2024-11-15T09:39:12Z) - SKADA-Bench: Benchmarking Unsupervised Domain Adaptation Methods with Realistic Validation [55.87169702896249]
Unsupervised Domain Adaptation (DA) consists of adapting a model trained on a labeled source domain to perform well on an unlabeled target domain with some data distribution shift.
We propose a framework to evaluate DA methods and present a fair evaluation of existing shallow algorithms, including reweighting, mapping, and subspace alignment.
Our benchmark highlights the importance of realistic validation and provides practical guidance for real-life applications.
arXiv Detail & Related papers (2024-07-16T12:52:29Z) - PeFAD: A Parameter-Efficient Federated Framework for Time Series Anomaly Detection [51.20479454379662]
We propose a.
Federated Anomaly Detection framework named PeFAD with the increasing privacy concerns.
We conduct extensive evaluations on four real datasets, where PeFAD outperforms existing state-of-the-art baselines by up to 28.74%.
arXiv Detail & Related papers (2024-06-04T13:51:08Z) - Empowering HWNs with Efficient Data Labeling: A Clustered Federated
Semi-Supervised Learning Approach [2.046985601687158]
Clustered Federated Multitask Learning (CFL) has gained considerable attention as an effective strategy for overcoming statistical challenges.
We introduce a novel framework, Clustered Federated Semi-Supervised Learning (CFSL), designed for more realistic HWN scenarios.
Our results demonstrate that CFSL significantly improves upon key metrics such as testing accuracy, labeling accuracy, and labeling latency under varying proportions of labeled and unlabeled data.
arXiv Detail & Related papers (2024-01-19T11:47:49Z) - An Evaluation of Machine Learning Approaches for Early Diagnosis of
Autism Spectrum Disorder [0.0]
Autistic Spectrum Disorder (ASD) is a neurological disease characterized by difficulties with social interaction, communication, and repetitive activities.
This study employs diverse machine learning methods to identify crucial ASD traits, aiming to enhance and automate the diagnostic process.
arXiv Detail & Related papers (2023-09-20T21:23:37Z) - A Dependable Hybrid Machine Learning Model for Network Intrusion
Detection [1.222622290392729]
We propose a new hybrid model that combines machine learning and deep learning to increase detection rates while securing dependability.
Our method produces excellent results when tested on two datasets, KDDCUP'99 and CIC-MalMem-2022.
arXiv Detail & Related papers (2022-12-08T20:19:27Z) - Explaining Cross-Domain Recognition with Interpretable Deep Classifier [100.63114424262234]
Interpretable Deep (IDC) learns the nearest source samples of a target sample as evidence upon which the classifier makes the decision.
Our IDC leads to a more explainable model with almost no accuracy degradation and effectively calibrates classification for optimum reject options.
arXiv Detail & Related papers (2022-11-15T15:58:56Z) - Using Sampling to Estimate and Improve Performance of Automated Scoring
Systems with Guarantees [63.62448343531963]
We propose a combination of the existing paradigms, sampling responses to be scored by humans intelligently.
We observe significant gains in accuracy (19.80% increase on average) and quadratic weighted kappa (QWK) (25.60% on average) with a relatively small human budget.
arXiv Detail & Related papers (2021-11-17T05:00:51Z) - Anomaly Detection Based on Selection and Weighting in Latent Space [73.01328671569759]
We propose a novel selection-and-weighting-based anomaly detection framework called SWAD.
Experiments on both benchmark and real-world datasets have shown the effectiveness and superiority of SWAD.
arXiv Detail & Related papers (2021-03-08T10:56:38Z) - TELESTO: A Graph Neural Network Model for Anomaly Classification in
Cloud Services [77.454688257702]
Machine learning (ML) and artificial intelligence (AI) are applied on IT system operation and maintenance.
One direction aims at the recognition of re-occurring anomaly types to enable remediation automation.
We propose a method that is invariant to dimensionality changes of given data.
arXiv Detail & Related papers (2021-02-25T14:24:49Z) - Collaborative residual learners for automatic icd10 prediction using
prescribed medications [45.82374977939355]
We propose a novel collaborative residual learning based model to automatically predict ICD10 codes employing only prescriptions data.
We obtain multi-label classification accuracy of 0.71 and 0.57 of average precision, 0.57 and 0.38 of F1-score and 0.73 and 0.44 of accuracy in predicting principal diagnosis for inpatient and outpatient datasets respectively.
arXiv Detail & Related papers (2020-12-16T07:07:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.