De-identification of Unstructured Clinical Texts from Sequence to
Sequence Perspective
- URL: http://arxiv.org/abs/2108.07971v1
- Date: Wed, 18 Aug 2021 04:48:58 GMT
- Title: De-identification of Unstructured Clinical Texts from Sequence to
Sequence Perspective
- Authors: Md Monowar Anjum, Noman Mohammed, Xiaoqian Jiang
- Abstract summary: We formulate the de-identification problem as a sequence to sequence learning problem instead of a token classification problem.
Early experimentation of our proposed approach achieved 98.91% recall rate on i2b2 dataset.
- Score: 8.615499133294097
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In this work, we propose a novel problem formulation for de-identification of
unstructured clinical text. We formulate the de-identification problem as a
sequence to sequence learning problem instead of a token classification
problem. Our approach is inspired by the recent state-of -the-art performance
of sequence to sequence learning models for named entity recognition. Early
experimentation of our proposed approach achieved 98.91% recall rate on i2b2
dataset. This performance is comparable to current state-of-the-art models for
unstructured clinical text de-identification.
Related papers
- Anonpsy: A Graph-Based Framework for Structure-Preserving De-identification of Psychiatric Narratives [1.4652274443334974]
We introduce Anonpsy, a de-identification framework that reformulates the task as graph-guided semantic rewriting.<n>Anonpsy converts each narrative into a semantic graph encoding clinical entities, temporal anchors, and typed relations.<n>It preserves diagnostic fidelity while achieving consistently low re-identification risk under expert, semantic, and GPT-5-based evaluations.
arXiv Detail & Related papers (2026-01-20T01:37:44Z) - Noise & pattern: identity-anchored Tikhonov regularization for robust structural anomaly detection [58.535473924035365]
Anomaly detection plays a pivotal role in automated industrial inspection, aiming to identify subtle or rare defects in otherwise uniform visual patterns.<n>We tackle structural anomaly detection using a self-supervised autoencoder that learns to repair corrupted inputs.<n>We introduce a corruption model that injects artificial disruptions into training images to mimic structural defects.
arXiv Detail & Related papers (2025-11-10T15:48:50Z) - "Principal Components" Enable A New Language of Images [79.45806370905775]
We introduce a novel visual tokenization framework that embeds a provable PCA-like structure into the latent token space.
Our approach achieves state-of-the-art reconstruction performance and enables better interpretability to align with the human vision system.
arXiv Detail & Related papers (2025-03-11T17:59:41Z) - Detecting Neurocognitive Disorders through Analyses of Topic Evolution and Cross-modal Consistency in Visual-Stimulated Narratives [84.03001845263]
Early detection of neurocognitive disorders (NCDs) is crucial for timely intervention and disease management.
Traditional narrative analysis often focuses on local indicators in microstructure, such as word usage and syntax.
We propose to investigate specific cognitive and linguistic challenges by analyzing topical shifts, temporal dynamics, and the coherence of narratives over time.
arXiv Detail & Related papers (2025-01-07T12:16:26Z) - Long-Sequence Recommendation Models Need Decoupled Embeddings [49.410906935283585]
We identify and characterize a neglected deficiency in existing long-sequence recommendation models.
A single set of embeddings struggles with learning both attention and representation, leading to interference between these two processes.
We propose the Decoupled Attention and Representation Embeddings (DARE) model, where two distinct embedding tables are learned separately to fully decouple attention and representation.
arXiv Detail & Related papers (2024-10-03T15:45:15Z) - DeIDClinic: A Multi-Layered Framework for De-identification of Clinical Free-text Data [6.473402241020136]
This work enhances the MASK framework by integrating ClinicalBERT, a deep learning model specifically fine-tuned on clinical texts.
The system effectively identifies and either redacts or replaces sensitive identifiable entities within clinical documents.
A risk assessment feature has also been developed, which analyses the uniqueness of context within documents to classify them into risk levels.
arXiv Detail & Related papers (2024-10-02T15:16:02Z) - On the Importance of Step-wise Embeddings for Heterogeneous Clinical
Time-Series [1.3285222309805063]
Recent advances in deep learning for sequence modeling have not fully transferred to tasks handling time-series from electronic health records.
In particular, in problems related to the Intensive Care Unit (ICU), the state-of-the-art remains to tackle sequence classification in a tabular manner with tree-based methods.
arXiv Detail & Related papers (2023-11-15T12:18:15Z) - Pyclipse, a library for deidentification of free-text clinical notes [0.40329768057075643]
We propose the pyclipse framework to streamline the comparison of deidentification algorithms.
Pyclipse serves as a single interface for running open-source deidentification algorithms on local clinical data.
We find that algorithm performance consistently falls short of the results reported in the original papers, even when evaluated on the same benchmark dataset.
arXiv Detail & Related papers (2023-11-05T19:56:58Z) - Towards Semi-Structured Automatic ICD Coding via Tree-based Contrastive
Learning [18.380293890624102]
We investigate the semi-structured nature of clinical notes and propose an automatic algorithm to segment them into sections.
To address the variability issues in existing ICD coding models with limited data, we introduce a contrastive pre-training approach on sections.
arXiv Detail & Related papers (2023-10-14T22:07:13Z) - Towards preserving word order importance through Forced Invalidation [80.33036864442182]
We show that pre-trained language models are insensitive to word order.
We propose Forced Invalidation to help preserve the importance of word order.
Our experiments demonstrate that Forced Invalidation significantly improves the sensitivity of the models to word order.
arXiv Detail & Related papers (2023-04-11T13:42:10Z) - Mutual Exclusivity Training and Primitive Augmentation to Induce
Compositionality [84.94877848357896]
Recent datasets expose the lack of the systematic generalization ability in standard sequence-to-sequence models.
We analyze this behavior of seq2seq models and identify two contributing factors: a lack of mutual exclusivity bias and the tendency to memorize whole examples.
We show substantial empirical improvements using standard sequence-to-sequence models on two widely-used compositionality datasets.
arXiv Detail & Related papers (2022-11-28T17:36:41Z) - SimMC: Simple Masked Contrastive Learning of Skeleton Representations
for Unsupervised Person Re-Identification [63.903237777588316]
We present a generic Simple Masked Contrastive learning (SimMC) framework to learn effective representations from unlabeled 3D skeletons for person re-ID.
Specifically, to fully exploit skeleton features within each skeleton sequence, we first devise a masked prototype contrastive learning (MPC) scheme.
Then, we propose the masked intra-sequence contrastive learning (MIC) to capture intra-sequence pattern consistency between subsequences.
arXiv Detail & Related papers (2022-04-21T00:19:38Z) - LifeLonger: A Benchmark for Continual Disease Classification [59.13735398630546]
We introduce LifeLonger, a benchmark for continual disease classification on the MedMNIST collection.
Task and class incremental learning of diseases address the issue of classifying new samples without re-training the models from scratch.
Cross-domain incremental learning addresses the issue of dealing with datasets originating from different institutions while retaining the previously obtained knowledge.
arXiv Detail & Related papers (2022-04-12T12:25:05Z) - Detecting of a Patient's Condition From Clinical Narratives Using
Natural Language Representation [0.3149883354098941]
This paper proposes a joint clinical natural language representation learning and supervised classification framework.
The novel framework jointly discovers distributional syntactic and latent semantic (representation learning) from contextual clinical narrative inputs.
The proposed framework yields an overall classification performance with accuracy, recall, and precision of 89 % and 88 %, 89 %, respectively.
arXiv Detail & Related papers (2021-04-08T17:16:04Z) - An Interpretable End-to-end Fine-tuning Approach for Long Clinical Text [72.62848911347466]
Unstructured clinical text in EHRs contains crucial information for applications including decision support, trial matching, and retrospective research.
Recent work has applied BERT-based models to clinical information extraction and text classification, given these models' state-of-the-art performance in other NLP domains.
In this work, we propose a novel fine-tuning approach called SnipBERT. Instead of using entire notes, SnipBERT identifies crucial snippets and feeds them into a truncated BERT-based model in a hierarchical manner.
arXiv Detail & Related papers (2020-11-12T17:14:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.