Memorization of Named Entities in Fine-tuned BERT Models
- URL: http://arxiv.org/abs/2212.03749v2
- Date: Tue, 10 Oct 2023 14:32:43 GMT
- Title: Memorization of Named Entities in Fine-tuned BERT Models
- Authors: Andor Diera and Nicolas Lell and Aygul Garifullina and Ansgar Scherp
- Abstract summary: We investigate the extent of named entity memorization in fine-tuned BERT models.
We show that a fine-tuned BERT does not generate more named entities specific to the fine-tuning dataset than a BERT model that is pre-trained only.
- Score: 3.0177210416625115
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Privacy preserving deep learning is an emerging field in machine learning
that aims to mitigate the privacy risks in the use of deep neural networks. One
such risk is training data extraction from language models that have been
trained on datasets, which contain personal and privacy sensitive information.
In our study, we investigate the extent of named entity memorization in
fine-tuned BERT models. We use single-label text classification as
representative downstream task and employ three different fine-tuning setups in
our experiments, including one with Differentially Privacy (DP). We create a
large number of text samples from the fine-tuned BERT models utilizing a custom
sequential sampling strategy with two prompting strategies. We search in these
samples for named entities and check if they are also present in the
fine-tuning datasets. We experiment with two benchmark datasets in the domains
of emails and blogs. We show that the application of DP has a detrimental
effect on the text generation capabilities of BERT. Furthermore, we show that a
fine-tuned BERT does not generate more named entities specific to the
fine-tuning dataset than a BERT model that is pre-trained only. This suggests
that BERT is unlikely to emit personal or privacy sensitive named entities.
Overall, our results are important to understand to what extent BERT-based
services are prone to training data extraction attacks.
Related papers
- Seed-Guided Fine-Grained Entity Typing in Science and Engineering
Domains [51.02035914828596]
We study the task of seed-guided fine-grained entity typing in science and engineering domains.
We propose SEType which first enriches the weak supervision by finding more entities for each seen type from an unlabeled corpus.
It then matches the enriched entities to unlabeled text to get pseudo-labeled samples and trains a textual entailment model that can make inferences for both seen and unseen types.
arXiv Detail & Related papers (2024-01-23T22:36:03Z) - PrivacyMind: Large Language Models Can Be Contextual Privacy Protection Learners [81.571305826793]
We introduce Contextual Privacy Protection Language Models (PrivacyMind)
Our work offers a theoretical analysis for model design and benchmarks various techniques.
In particular, instruction tuning with both positive and negative examples stands out as a promising method.
arXiv Detail & Related papers (2023-10-03T22:37:01Z) - Which Student is Best? A Comprehensive Knowledge Distillation Exam for
Task-Specific BERT Models [3.303435360096988]
We perform knowledge distillation benchmark from task-specific BERT-base teacher models to various student models.
Our experiment involves 12 datasets grouped in two tasks: text classification and sequence labeling in the Indonesian language.
Our experiments show that, despite the rising popularity of Transformer-based models, using BiLSTM and CNN student models provide the best trade-off between performance and computational resource.
arXiv Detail & Related papers (2022-01-03T10:07:13Z) - Unsupervised Domain Adaptive Learning via Synthetic Data for Person
Re-identification [101.1886788396803]
Person re-identification (re-ID) has gained more and more attention due to its widespread applications in video surveillance.
Unfortunately, the mainstream deep learning methods still need a large quantity of labeled data to train models.
In this paper, we develop a data collector to automatically generate synthetic re-ID samples in a computer game, and construct a data labeler to simultaneously annotate them.
arXiv Detail & Related papers (2021-09-12T15:51:41Z) - De-identification of Privacy-related Entities in Job Postings [10.751883216434717]
De-identification is the task of detecting privacy-related entities in text, such as person names, emails and contact data.
We present JobStack, a new corpus for de-identification of personal data in job vacancies on Stackoverflow.
arXiv Detail & Related papers (2021-05-24T12:01:22Z) - Fine-Tuning BERT for Sentiment Analysis of Vietnamese Reviews [0.0]
Experimental results on two datasets show thatmodels using BERT slightly outperform other models usingGloVe and FastText.
Our proposed BERT fine-tuning method produces amodel with better performance than the original BERT fine-tuning method.
arXiv Detail & Related papers (2020-11-20T14:45:46Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z) - Sensitive Data Detection and Classification in Spanish Clinical Text:
Experiments with BERT [0.8379286663107844]
In this paper, we use a BERT-based sequence labelling model to conduct anonymisation experiments in Spanish.
Experiments show that a simple BERT-based model with general-domain pre-training obtains highly competitive results without any domain specific feature engineering.
arXiv Detail & Related papers (2020-03-06T09:46:51Z) - What BERT Sees: Cross-Modal Transfer for Visual Question Generation [21.640299110619384]
We study the visual capabilities of BERT out-of-the-box, by avoiding pre-training made on supplementary data.
We introduce BERT-gen, a BERT-based architecture for text generation, able to leverage on either mono- or multi- modal representations.
arXiv Detail & Related papers (2020-02-25T12:44:36Z) - Improving BERT Fine-Tuning via Self-Ensemble and Self-Distillation [84.64004917951547]
Fine-tuning pre-trained language models like BERT has become an effective way in NLP.
In this paper, we improve the fine-tuning of BERT with two effective mechanisms: self-ensemble and self-distillation.
arXiv Detail & Related papers (2020-02-24T16:17:12Z) - Incorporating BERT into Neural Machine Translation [251.54280200353674]
We propose a new algorithm named BERT-fused model, in which we first use BERT to extract representations for an input sequence.
We conduct experiments on supervised (including sentence-level and document-level translations), semi-supervised and unsupervised machine translation, and achieve state-of-the-art results on seven benchmark datasets.
arXiv Detail & Related papers (2020-02-17T08:13:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.