Few-Shot Cross-lingual Transfer for Coarse-grained De-identification of
Code-Mixed Clinical Texts
- URL: http://arxiv.org/abs/2204.04775v1
- Date: Sun, 10 Apr 2022 21:46:52 GMT
- Title: Few-Shot Cross-lingual Transfer for Coarse-grained De-identification of
Code-Mixed Clinical Texts
- Authors: Saadullah Amin, Noon Pokaratsiri Goldstein, Morgan Kelly Wixted,
Alejandro Garc\'ia-Rudolph, Catalina Mart\'inez-Costa, G\"unter Neumann
- Abstract summary: Pre-trained language models (LMs) have shown great potential for cross-lingual transfer in low-resource settings.
We show the few-shot cross-lingual transfer property of LMs for named recognition (NER) and apply it to solve a low-resource and real-world challenge of code-mixed (Spanish-Catalan) clinical notes de-identification in the stroke.
- Score: 56.72488923420374
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite the advances in digital healthcare systems offering curated
structured knowledge, much of the critical information still lies in large
volumes of unlabeled and unstructured clinical texts. These texts, which often
contain protected health information (PHI), are exposed to information
extraction tools for downstream applications, risking patient identification.
Existing works in de-identification rely on using large-scale annotated corpora
in English, which often are not suitable in real-world multilingual settings.
Pre-trained language models (LM) have shown great potential for cross-lingual
transfer in low-resource settings. In this work, we empirically show the
few-shot cross-lingual transfer property of LMs for named entity recognition
(NER) and apply it to solve a low-resource and real-world challenge of
code-mixed (Spanish-Catalan) clinical notes de-identification in the stroke
domain. We annotate a gold evaluation dataset to assess few-shot setting
performance where we only use a few hundred labeled examples for training. Our
model improves the zero-shot F1-score from 73.7% to 91.2% on the gold
evaluation set when adapting Multilingual BERT (mBERT) (Devlin et al., 2019)
from the MEDDOCAN (Marimon et al., 2019) corpus with our few-shot cross-lingual
target corpus. When generalized to an out-of-sample test set, the best model
achieves a human-evaluation F1-score of 97.2%.
Related papers
- DAEDRA: A language model for predicting outcomes in passive
pharmacovigilance reporting [0.0]
DAEDRA is a large language model designed to detect regulatory-relevant outcomes in adverse event reports.
This paper details the conception, design, training and evaluation of DAEDRA.
arXiv Detail & Related papers (2024-02-10T16:48:45Z) - FRASIMED: a Clinical French Annotated Resource Produced through
Crosslingual BERT-Based Annotation Projection [0.6116681488656472]
This research article introduces a methodology for generating translated versions of annotated datasets through crosslingual annotation projection.
We present the creation of French Annotated Resource with Semantic Information for Medical Detection (FRASIMED), an annotated corpus comprising 2'051 synthetic clinical cases in French.
arXiv Detail & Related papers (2023-09-19T17:17:28Z) - CROP: Zero-shot Cross-lingual Named Entity Recognition with Multilingual
Labeled Sequence Translation [113.99145386490639]
Cross-lingual NER can transfer knowledge between languages via aligned cross-lingual representations or machine translation results.
We propose a Cross-lingual Entity Projection framework (CROP) to enable zero-shot cross-lingual NER.
We adopt a multilingual labeled sequence translation model to project the tagged sequence back to the target language and label the target raw sentence.
arXiv Detail & Related papers (2022-10-13T13:32:36Z) - Cross-lingual Approaches for the Detection of Adverse Drug Reactions in
German from a Patient's Perspective [3.8233498951276403]
We present the first corpus for German Adverse Drug Reaction detection in patient-generated content.
The data consists of 4,169 binary annotated documents from a German patient forum.
arXiv Detail & Related papers (2022-08-03T12:52:01Z) - On Cross-Lingual Retrieval with Multilingual Text Encoders [51.60862829942932]
We study the suitability of state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks.
We benchmark their performance in unsupervised ad-hoc sentence- and document-level CLIR experiments.
We evaluate multilingual encoders fine-tuned in a supervised fashion (i.e., we learn to rank) on English relevance data in a series of zero-shot language and domain transfer CLIR experiments.
arXiv Detail & Related papers (2021-12-21T08:10:27Z) - Estimating Redundancy in Clinical Text [6.245180523143739]
Clinicians populate new documents by duplicating existing notes, then updating accordingly.
quantifying information redundancy can play an essential role in evaluating innovations that operate on clinical narratives.
We present and evaluate two strategies to measure redundancy: an information-theoretic approach and a lexicosyntactic and semantic model.
arXiv Detail & Related papers (2021-05-25T11:01:45Z) - AmericasNLI: Evaluating Zero-shot Natural Language Understanding of
Pretrained Multilingual Models in Truly Low-resource Languages [75.08199398141744]
We present AmericasNLI, an extension of XNLI (Conneau et al.), to 10 indigenous languages of the Americas.
We conduct experiments with XLM-R, testing multiple zero-shot and translation-based approaches.
We find that XLM-R's zero-shot performance is poor for all 10 languages, with an average performance of 38.62%.
arXiv Detail & Related papers (2021-04-18T05:32:28Z) - Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language
Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements.
We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations.
Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z) - Explicit Alignment Objectives for Multilingual Bidirectional Encoders [111.65322283420805]
We present a new method for learning multilingual encoders, AMBER (Aligned Multilingual Bi-directional EncodeR)
AMBER is trained on additional parallel data using two explicit alignment objectives that align the multilingual representations at different granularities.
Experimental results show that AMBER obtains gains of up to 1.1 average F1 score on sequence tagging and up to 27.3 average accuracy on retrieval over the XLMR-large model.
arXiv Detail & Related papers (2020-10-15T18:34:13Z) - A Bayesian Multilingual Document Model for Zero-shot Topic Identification and Discovery [1.9215779751499527]
The model is an extension of BaySMM [Kesiraju et al 2020] to the multilingual scenario.
We propagate the learned uncertainties through linear classifiers that benefit zero-shot cross-lingual topic identification.
We revisit cross-lingual topic identification in zero-shot settings by taking a deeper dive into current datasets.
arXiv Detail & Related papers (2020-07-02T19:55:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.