Related papers: Subword Embedding from Bytes Gains Privacy without Sacrificing Accuracy and Complexity

Subword Embedding from Bytes Gains Privacy without Sacrificing Accuracy and Complexity

URL: http://arxiv.org/abs/2410.16410v1
Date: Mon, 21 Oct 2024 18:25:24 GMT
Title: Subword Embedding from Bytes Gains Privacy without Sacrificing Accuracy and Complexity
Authors: Mengjiao Zhang, Jia Xu,
Abstract summary: We propose Subword Embedding from Bytes (SEB) and encode subwords to byte sequences using deep neural networks. Our solution outperforms conventional approaches by preserving privacy without sacrificing efficiency or accuracy. We verify SEB obtains comparable and even better results over standard subword embedding methods in machine translation, sentiment analysis, and language modeling.
Score: 5.7601856226895665
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While NLP models significantly impact our lives, there are rising concerns about privacy invasion. Although federated learning enhances privacy, attackers may recover private training data by exploiting model parameters and gradients. Therefore, protecting against such embedding attacks remains an open challenge. To address this, we propose Subword Embedding from Bytes (SEB) and encode subwords to byte sequences using deep neural networks, making input text recovery harder. Importantly, our method requires a smaller memory with $256$ bytes of vocabulary while keeping efficiency with the same input length. Thus, our solution outperforms conventional approaches by preserving privacy without sacrificing efficiency or accuracy. Our experiments show SEB can effectively protect against embedding-based attacks from recovering original sentences in federated learning. Meanwhile, we verify that SEB obtains comparable and even better results over standard subword embedding methods in machine translation, sentiment analysis, and language modeling with even lower time and space complexity.

Related papers

ExpShield: Safeguarding Web Text from Unauthorized Crawling and Language Modeling Exploitation [17.71790411163849]
We propose ExpShiled, a proactive self-defense mechanism that mitigates sample-specific memorization via imperceptible text perturbations.<n>Our approach requires no external collaboration while maintaining original readability.<n>Even with privacy backdoors, the Membership Inference Attack (MIA) AUC drops from 0.95 to 0.55, and instance exploitation approaches zero.
arXiv Detail & Related papers (2024-12-30T17:52:02Z)
Pseudo-Probability Unlearning: Towards Efficient and Privacy-Preserving Machine Unlearning [59.29849532966454]
We propose PseudoProbability Unlearning (PPU), a novel method that enables models to forget data to adhere to privacy-preserving manner. Our method achieves over 20% improvements in forgetting error compared to the state-of-the-art.
arXiv Detail & Related papers (2024-11-04T21:27:06Z)
NAP^2: A Benchmark for Naturalness and Privacy-Preserving Text Rewriting by Learning from Human [55.20137833039499]
We suggest sanitizing sensitive text using two common strategies used by humans. We curate the first corpus, coined NAP2, through both crowdsourcing and the use of large language models.
arXiv Detail & Related papers (2024-06-06T05:07:44Z)
InferDPT: Privacy-Preserving Inference for Black-box Large Language Model [66.07752875835506]
InferDPT is the first practical framework for the privacy-preserving Inference of black-box LLMs. RANTEXT is a novel differential privacy mechanism integrated into the perturbation module of InferDPT.
arXiv Detail & Related papers (2023-10-18T18:00:11Z)
PrivacyMind: Large Language Models Can Be Contextual Privacy Protection Learners [81.571305826793]
We introduce Contextual Privacy Protection Language Models (PrivacyMind) Our work offers a theoretical analysis for model design and benchmarks various techniques. In particular, instruction tuning with both positive and negative examples stands out as a promising method.
arXiv Detail & Related papers (2023-10-03T22:37:01Z)
Planting and Mitigating Memorized Content in Predictive-Text Language Models [11.911353678499008]
Language models are widely deployed to provide automatic text completion services in user products. Recent research has revealed that language models bear considerable risk of memorizing private training data. In this study, we test the efficacy of a range of privacy-preserving techniques to mitigate unintended memorization of sensitive user text.
arXiv Detail & Related papers (2022-12-16T17:57:14Z)
Privacy-Preserving Text Classification on BERT Embeddings with Homomorphic Encryption [23.010346603025255]
We propose a privatization mechanism for embeddings based on homomorphic encryption. We show that our method offers encrypted protection of BERT embeddings, while largely preserving their utility on downstream text classification tasks.
arXiv Detail & Related papers (2022-10-05T21:46:02Z)
Defending against Reconstruction Attacks with R\'enyi Differential Privacy [72.1188520352079]
Reconstruction attacks allow an adversary to regenerate data samples of the training set using access to only a trained model. Differential privacy is a known solution to such attacks, but is often used with a relatively large privacy budget. We show that, for a same mechanism, we can derive privacy guarantees for reconstruction attacks that are better than the traditional ones from the literature.
arXiv Detail & Related papers (2022-02-15T18:09:30Z)
Federated Deep Learning with Bayesian Privacy [28.99404058773532]
Federated learning (FL) aims to protect data privacy by cooperatively learning a model without sharing private data among users. Homomorphic encryption (HE) based methods provide secure privacy protections but suffer from extremely high computational and communication overheads. Deep learning with Differential Privacy (DP) was implemented as a practical learning algorithm at a manageable cost in complexity.
arXiv Detail & Related papers (2021-09-27T12:48:40Z)
CAPE: Context-Aware Private Embeddings for Private Language Learning [0.5156484100374058]
Context-Aware Private Embeddings (CAPE) is a novel approach which preserves privacy during training of embeddings. CAPE applies calibrated noise through differential privacy, preserving the encoded semantic links while obscuring sensitive information. Experimental results demonstrate that the proposed approach reduces private information leakage better than either single intervention.
arXiv Detail & Related papers (2021-08-27T14:50:12Z)
Privacy-Adaptive BERT for Natural Language Understanding [20.821155542969947]
We study how to improve the effectiveness of NLU models under a Local Privacy setting using BERT. We propose privacy-adaptive LM pretraining methods and demonstrate that they can significantly improve model performance on privatized text input.
arXiv Detail & Related papers (2021-04-15T15:01:28Z)
BERT-ATTACK: Adversarial Attack Against BERT Using BERT [77.82947768158132]
Adrial attacks for discrete data (such as texts) are more challenging than continuous data (such as images) We propose textbfBERT-Attack, a high-quality and effective method to generate adversarial samples. Our method outperforms state-of-the-art attack strategies in both success rate and perturb percentage.
arXiv Detail & Related papers (2020-04-21T13:30:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.