Sensitive Data Detection and Classification in Spanish Clinical Text:
Experiments with BERT
- URL: http://arxiv.org/abs/2003.03106v2
- Date: Tue, 17 Mar 2020 13:16:41 GMT
- Title: Sensitive Data Detection and Classification in Spanish Clinical Text:
Experiments with BERT
- Authors: Aitor Garc\'ia-Pablos, Naiara Perez, Montse Cuadros
- Abstract summary: In this paper, we use a BERT-based sequence labelling model to conduct anonymisation experiments in Spanish.
Experiments show that a simple BERT-based model with general-domain pre-training obtains highly competitive results without any domain specific feature engineering.
- Score: 0.8379286663107844
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Massive digital data processing provides a wide range of opportunities and
benefits, but at the cost of endangering personal data privacy. Anonymisation
consists in removing or replacing sensitive information from data, enabling its
exploitation for different purposes while preserving the privacy of
individuals. Over the years, a lot of automatic anonymisation systems have been
proposed; however, depending on the type of data, the target language or the
availability of training documents, the task remains challenging still. The
emergence of novel deep-learning models during the last two years has brought
large improvements to the state of the art in the field of Natural Language
Processing. These advancements have been most noticeably led by BERT, a model
proposed by Google in 2018, and the shared language models pre-trained on
millions of documents. In this paper, we use a BERT-based sequence labelling
model to conduct a series of anonymisation experiments on several clinical
datasets in Spanish. We also compare BERT to other algorithms. The experiments
show that a simple BERT-based model with general-domain pre-training obtains
highly competitive results without any domain specific feature engineering.
Related papers
- PrivacyMind: Large Language Models Can Be Contextual Privacy Protection Learners [81.571305826793]
We introduce Contextual Privacy Protection Language Models (PrivacyMind)
Our work offers a theoretical analysis for model design and benchmarks various techniques.
In particular, instruction tuning with both positive and negative examples stands out as a promising method.
arXiv Detail & Related papers (2023-10-03T22:37:01Z) - Memorization of Named Entities in Fine-tuned BERT Models [3.0177210416625115]
We investigate the extent of named entity memorization in fine-tuned BERT models.
We show that a fine-tuned BERT does not generate more named entities specific to the fine-tuning dataset than a BERT model that is pre-trained only.
arXiv Detail & Related papers (2022-12-07T16:20:50Z) - Differentially Private Language Models for Secure Data Sharing [19.918137395199224]
In this paper, we show how to train a generative language model in a differentially private manner and consequently sampling data from it.
Using natural language prompts and a new prompt-mismatch loss, we are able to create highly accurate and fluent textual datasets.
We perform thorough experiments indicating that our synthetic datasets do not leak information from our original data and are of high language quality.
arXiv Detail & Related papers (2022-10-25T11:12:56Z) - Transferring BERT-like Transformers' Knowledge for Authorship
Verification [8.443350618722562]
We study the effectiveness of several BERT-like transformers for the task of authorship verification.
We provide new splits for PAN-2020, where training and test data are sampled from disjoint topics or authors.
We show that those splits can enhance the models' capability to transfer knowledge over a new, significantly different dataset.
arXiv Detail & Related papers (2021-12-09T18:57:29Z) - Unsupervised Domain Adaptive Learning via Synthetic Data for Person
Re-identification [101.1886788396803]
Person re-identification (re-ID) has gained more and more attention due to its widespread applications in video surveillance.
Unfortunately, the mainstream deep learning methods still need a large quantity of labeled data to train models.
In this paper, we develop a data collector to automatically generate synthetic re-ID samples in a computer game, and construct a data labeler to simultaneously annotate them.
arXiv Detail & Related papers (2021-09-12T15:51:41Z) - Learning to Augment for Data-Scarce Domain BERT Knowledge Distillation [55.34995029082051]
We propose a method to learn to augment for data-scarce domain BERT knowledge distillation.
We show that the proposed method significantly outperforms state-of-the-art baselines on four different tasks.
arXiv Detail & Related papers (2021-01-20T13:07:39Z) - Adversarial Knowledge Transfer from Unlabeled Data [62.97253639100014]
We present a novel Adversarial Knowledge Transfer framework for transferring knowledge from internet-scale unlabeled data to improve the performance of a classifier.
An important novel aspect of our method is that the unlabeled source data can be of different classes from those of the labeled target data, and there is no need to define a separate pretext task.
arXiv Detail & Related papers (2020-08-13T08:04:27Z) - Domain-Specific Language Model Pretraining for Biomedical Natural
Language Processing [73.37262264915739]
We show that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains.
Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks.
arXiv Detail & Related papers (2020-07-31T00:04:15Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z) - What BERT Sees: Cross-Modal Transfer for Visual Question Generation [21.640299110619384]
We study the visual capabilities of BERT out-of-the-box, by avoiding pre-training made on supplementary data.
We introduce BERT-gen, a BERT-based architecture for text generation, able to leverage on either mono- or multi- modal representations.
arXiv Detail & Related papers (2020-02-25T12:44:36Z) - Federated pretraining and fine tuning of BERT using clinical notes from
multiple silos [4.794677806040309]
We show that it is possible to pretrain and fine tune BERT models in a federated manner using clinical texts from different silos without moving the data.
arXiv Detail & Related papers (2020-02-20T04:14:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.