Related papers: Memorization of Named Entities in Fine-tuned BERT Models

Memorization of Named Entities in Fine-tuned BERT Models

URL: http://arxiv.org/abs/2212.03749v2
Date: Tue, 10 Oct 2023 14:32:43 GMT
Title: Memorization of Named Entities in Fine-tuned BERT Models
Authors: Andor Diera and Nicolas Lell and Aygul Garifullina and Ansgar Scherp
Abstract summary: We investigate the extent of named entity memorization in fine-tuned BERT models. We show that a fine-tuned BERT does not generate more named entities specific to the fine-tuning dataset than a BERT model that is pre-trained only.
Score: 3.0177210416625115
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Privacy preserving deep learning is an emerging field in machine learning that aims to mitigate the privacy risks in the use of deep neural networks. One such risk is training data extraction from language models that have been trained on datasets, which contain personal and privacy sensitive information. In our study, we investigate the extent of named entity memorization in fine-tuned BERT models. We use single-label text classification as representative downstream task and employ three different fine-tuning setups in our experiments, including one with Differentially Privacy (DP). We create a large number of text samples from the fine-tuned BERT models utilizing a custom sequential sampling strategy with two prompting strategies. We search in these samples for named entities and check if they are also present in the fine-tuning datasets. We experiment with two benchmark datasets in the domains of emails and blogs. We show that the application of DP has a detrimental effect on the text generation capabilities of BERT. Furthermore, we show that a fine-tuned BERT does not generate more named entities specific to the fine-tuning dataset than a BERT model that is pre-trained only. This suggests that BERT is unlikely to emit personal or privacy sensitive named entities. Overall, our results are important to understand to what extent BERT-based services are prone to training data extraction attacks.

Related papers

Semantic similarity estimation for domain specific data using BERT and other techniques [0.0]
estimation of semantic similarity is an important research problem both in natural language processing and the natural language understanding.<n>We use two question pairs datasets for the analysis, one is a domain specific in-house dataset and the other is a public dataset.<n>We observe that the BERT model gave much superior performance as compared to the other methods.
arXiv Detail & Related papers (2025-06-23T13:03:59Z)
CBW: Towards Dataset Ownership Verification for Speaker Verification via Clustering-based Backdoor Watermarking [85.68235482145091]
Large-scale speech datasets have become valuable intellectual property. We propose a novel dataset ownership verification method. Our approach introduces a clustering-based backdoor watermark (CBW) We conduct extensive experiments on benchmark datasets, verifying the effectiveness and robustness of our method against potential adaptive attacks.
arXiv Detail & Related papers (2025-03-02T02:02:57Z)
Seed-Guided Fine-Grained Entity Typing in Science and Engineering Domains [51.02035914828596]
We study the task of seed-guided fine-grained entity typing in science and engineering domains. We propose SEType which first enriches the weak supervision by finding more entities for each seen type from an unlabeled corpus. It then matches the enriched entities to unlabeled text to get pseudo-labeled samples and trains a textual entailment model that can make inferences for both seen and unseen types.
arXiv Detail & Related papers (2024-01-23T22:36:03Z)
PrivacyMind: Large Language Models Can Be Contextual Privacy Protection Learners [81.571305826793]
We introduce Contextual Privacy Protection Language Models (PrivacyMind) Our work offers a theoretical analysis for model design and benchmarks various techniques. In particular, instruction tuning with both positive and negative examples stands out as a promising method.
arXiv Detail & Related papers (2023-10-03T22:37:01Z)
Which Student is Best? A Comprehensive Knowledge Distillation Exam for Task-Specific BERT Models [3.303435360096988]
We perform knowledge distillation benchmark from task-specific BERT-base teacher models to various student models. Our experiment involves 12 datasets grouped in two tasks: text classification and sequence labeling in the Indonesian language. Our experiments show that, despite the rising popularity of Transformer-based models, using BiLSTM and CNN student models provide the best trade-off between performance and computational resource.
arXiv Detail & Related papers (2022-01-03T10:07:13Z)
Unsupervised Domain Adaptive Learning via Synthetic Data for Person Re-identification [101.1886788396803]
Person re-identification (re-ID) has gained more and more attention due to its widespread applications in video surveillance. Unfortunately, the mainstream deep learning methods still need a large quantity of labeled data to train models. In this paper, we develop a data collector to automatically generate synthetic re-ID samples in a computer game, and construct a data labeler to simultaneously annotate them.
arXiv Detail & Related papers (2021-09-12T15:51:41Z)
De-identification of Privacy-related Entities in Job Postings [10.751883216434717]
De-identification is the task of detecting privacy-related entities in text, such as person names, emails and contact data. We present JobStack, a new corpus for de-identification of personal data in job vacancies on Stackoverflow.
arXiv Detail & Related papers (2021-05-24T12:01:22Z)
Fine-Tuning BERT for Sentiment Analysis of Vietnamese Reviews [0.0]
Experimental results on two datasets show thatmodels using BERT slightly outperform other models usingGloVe and FastText. Our proposed BERT fine-tuning method produces amodel with better performance than the original BERT fine-tuning method.
arXiv Detail & Related papers (2020-11-20T14:45:46Z)
Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training. We experimentally verify that the new dataset can significantly improve the ability of the learned FER model. To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)
Sensitive Data Detection and Classification in Spanish Clinical Text: Experiments with BERT [0.8379286663107844]
In this paper, we use a BERT-based sequence labelling model to conduct anonymisation experiments in Spanish. Experiments show that a simple BERT-based model with general-domain pre-training obtains highly competitive results without any domain specific feature engineering.
arXiv Detail & Related papers (2020-03-06T09:46:51Z)
What BERT Sees: Cross-Modal Transfer for Visual Question Generation [21.640299110619384]
We study the visual capabilities of BERT out-of-the-box, by avoiding pre-training made on supplementary data. We introduce BERT-gen, a BERT-based architecture for text generation, able to leverage on either mono- or multi- modal representations.
arXiv Detail & Related papers (2020-02-25T12:44:36Z)
Improving BERT Fine-Tuning via Self-Ensemble and Self-Distillation [84.64004917951547]
Fine-tuning pre-trained language models like BERT has become an effective way in NLP. In this paper, we improve the fine-tuning of BERT with two effective mechanisms: self-ensemble and self-distillation.
arXiv Detail & Related papers (2020-02-24T16:17:12Z)
Incorporating BERT into Neural Machine Translation [251.54280200353674]
We propose a new algorithm named BERT-fused model, in which we first use BERT to extract representations for an input sequence. We conduct experiments on supervised (including sentence-level and document-level translations), semi-supervised and unsupervised machine translation, and achieve state-of-the-art results on seven benchmark datasets.
arXiv Detail & Related papers (2020-02-17T08:13:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.