PHICON: Improving Generalization of Clinical Text De-identification
Models via Data Augmentation
- URL: http://arxiv.org/abs/2010.05143v1
- Date: Sun, 11 Oct 2020 02:57:11 GMT
- Title: PHICON: Improving Generalization of Clinical Text De-identification
Models via Data Augmentation
- Authors: Xiang Yue and Shuang Zhou
- Abstract summary: We propose a simple yet effective data augmentation method PHICON to alleviate the generalization issue.
PHICON consists of PHI augmentation and Context augmentation, which creates augmented training corpora.
Experimental results on the i2b2 2006 and 2014 de-identification challenge datasets show that PHICON can help three selected de-identification models boost F1-score (by at most 8.6%) on cross-dataset test setting.
- Score: 5.462226912969162
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: De-identification is the task of identifying protected health information
(PHI) in the clinical text. Existing neural de-identification models often fail
to generalize to a new dataset. We propose a simple yet effective data
augmentation method PHICON to alleviate the generalization issue. PHICON
consists of PHI augmentation and Context augmentation, which creates augmented
training corpora by replacing PHI entities with named-entities sampled from
external sources, and by changing background context with synonym replacement
or random word insertion, respectively. Experimental results on the i2b2 2006
and 2014 de-identification challenge datasets show that PHICON can help three
selected de-identification models boost F1-score (by at most 8.6%) on
cross-dataset test setting. We also discuss how much augmentation to use and
how each augmentation method influences the performance.
Related papers
- DIRI: Adversarial Patient Reidentification with Large Language Models for Evaluating Clinical Text Anonymization [13.038800602897354]
We develop an adversarial approach using a large language model to re-identify the patient corresponding to a redacted clinical note.
Our method uses a large language model to reidentify the patient corresponding to a redacted clinical note.
Although ClinicalBERT was the most effective, masking all identified PII, our tool still reidentified 9% of clinical notes.
arXiv Detail & Related papers (2024-10-22T14:06:31Z) - Robust Privacy Amidst Innovation with Large Language Models Through a Critical Assessment of the Risks [7.928574214440075]
This study examines integrating EHRs and NLP with large language models (LLMs) to improve healthcare data management and patient care.
It focuses on using advanced models to create secure, HIPAA-compliant synthetic patient notes for biomedical research.
arXiv Detail & Related papers (2024-07-23T04:20:14Z) - ASPS: Augmented Segment Anything Model for Polyp Segmentation [77.25557224490075]
The Segment Anything Model (SAM) has introduced unprecedented potential for polyp segmentation.
SAM's Transformer-based structure prioritizes global and low-frequency information.
CFA integrates a trainable CNN encoder branch with a frozen ViT encoder, enabling the integration of domain-specific knowledge.
arXiv Detail & Related papers (2024-06-30T14:55:32Z) - DetDiffusion: Synergizing Generative and Perceptive Models for Enhanced Data Generation and Perception [78.26734070960886]
Current perceptive models heavily depend on resource-intensive datasets.
We introduce perception-aware loss (P.A. loss) through segmentation, improving both quality and controllability.
Our method customizes data augmentation by extracting and utilizing perception-aware attribute (P.A. Attr) during generation.
arXiv Detail & Related papers (2024-03-20T04:58:03Z) - Part-aware Personalized Segment Anything Model for Patient-Specific
Segmentation [5.797437925674252]
Precision medicine, such as patient-adaptive treatments utilizing medical images, poses new challenges for image segmentation algorithms.
We propose a data-efficient segmentation method to address these challenges, namely Part-aware Personalized Segment Anything Model (P2SAM)
We introduce a novel part-aware prompt mechanism to select multiple-point prompts based on part-level features of the one-shot data.
arXiv Detail & Related papers (2024-03-08T16:34:30Z) - Data-Centric Long-Tailed Image Recognition [49.90107582624604]
Long-tail models exhibit a strong demand for high-quality data.
Data-centric approaches aim to enhance both the quantity and quality of data to improve model performance.
There is currently a lack of research into the underlying mechanisms explaining the effectiveness of information augmentation.
arXiv Detail & Related papers (2023-11-03T06:34:37Z) - EMIT-Diff: Enhancing Medical Image Segmentation via Text-Guided
Diffusion Model [4.057796755073023]
We develop controllable diffusion models for medical image synthesis, called EMIT-Diff.
We leverage recent diffusion probabilistic models to generate realistic and diverse synthetic medical image data.
In our approach, we ensure that the synthesized samples adhere to medically relevant constraints.
arXiv Detail & Related papers (2023-10-19T16:18:02Z) - Textual Data Augmentation for Patient Outcomes Prediction [67.72545656557858]
We propose a novel data augmentation method to generate artificial clinical notes in patients' Electronic Health Records.
We fine-tune the generative language model GPT-2 to synthesize labeled text with the original training data.
We evaluate our method on the most common patient outcome, i.e., the 30-day readmission rate.
arXiv Detail & Related papers (2022-11-13T01:07:23Z) - Dataset Distillation via Factorization [58.8114016318593]
We introduce a emphdataset factorization approach, termed emphHaBa, which is a plug-and-play strategy portable to any existing dataset distillation (DD) baseline.
emphHaBa explores decomposing a dataset into two components: data emphHallucination networks and emphBases.
Our method can yield significant improvement on downstream classification tasks compared with previous state of the arts, while reducing the total number of compressed parameters by up to 65%.
arXiv Detail & Related papers (2022-10-30T08:36:19Z) - An Analysis of Simple Data Augmentation for Named Entity Recognition [21.013836715832564]
We design and compare data augmentation for named entity recognition.
We show that simple augmentation can boost performance for both recurrent and transformer-based models.
arXiv Detail & Related papers (2020-10-22T13:21:03Z) - CoDA: Contrast-enhanced and Diversity-promoting Data Augmentation for
Natural Language Understanding [67.61357003974153]
We propose a novel data augmentation framework dubbed CoDA.
CoDA synthesizes diverse and informative augmented examples by integrating multiple transformations organically.
A contrastive regularization objective is introduced to capture the global relationship among all the data samples.
arXiv Detail & Related papers (2020-10-16T23:57:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.