Privacy-Preserving Text Classification on BERT Embeddings with
Homomorphic Encryption
- URL: http://arxiv.org/abs/2210.02574v1
- Date: Wed, 5 Oct 2022 21:46:02 GMT
- Title: Privacy-Preserving Text Classification on BERT Embeddings with
Homomorphic Encryption
- Authors: Garam Lee, Minsoo Kim, Jai Hyun Park, Seung-won Hwang, Jung Hee Cheon
- Abstract summary: We propose a privatization mechanism for embeddings based on homomorphic encryption.
We show that our method offers encrypted protection of BERT embeddings, while largely preserving their utility on downstream text classification tasks.
- Score: 23.010346603025255
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Embeddings, which compress information in raw text into semantics-preserving
low-dimensional vectors, have been widely adopted for their efficacy. However,
recent research has shown that embeddings can potentially leak private
information about sensitive attributes of the text, and in some cases, can be
inverted to recover the original input text. To address these growing privacy
challenges, we propose a privatization mechanism for embeddings based on
homomorphic encryption, to prevent potential leakage of any piece of
information in the process of text classification. In particular, our method
performs text classification on the encryption of embeddings from
state-of-the-art models like BERT, supported by an efficient GPU implementation
of CKKS encryption scheme. We show that our method offers encrypted protection
of BERT embeddings, while largely preserving their utility on downstream text
classification tasks.
Related papers
- Subword Embedding from Bytes Gains Privacy without Sacrificing Accuracy and Complexity [5.7601856226895665]
We propose Subword Embedding from Bytes (SEB) and encode subwords to byte sequences using deep neural networks.
Our solution outperforms conventional approaches by preserving privacy without sacrificing efficiency or accuracy.
We verify SEB obtains comparable and even better results over standard subword embedding methods in machine translation, sentiment analysis, and language modeling.
arXiv Detail & Related papers (2024-10-21T18:25:24Z) - Decoder Pre-Training with only Text for Scene Text Recognition [54.93037783663204]
Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets.
We introduce a novel method named Decoder Pre-training with only text for STR (DPTR)
DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder.
arXiv Detail & Related papers (2024-08-11T06:36:42Z) - Just Rewrite It Again: A Post-Processing Method for Enhanced Semantic Similarity and Privacy Preservation of Differentially Private Rewritten Text [3.3916160303055567]
We propose a simple post-processing method based on the goal of aligning rewritten texts with their original counterparts.
Our results show that such an approach not only produces outputs that are more semantically reminiscent of the original inputs, but also texts which score on average better in empirical privacy evaluations.
arXiv Detail & Related papers (2024-05-30T08:41:33Z) - Latent Guard: a Safety Framework for Text-to-image Generation [64.49596711025993]
Existing safety measures are either based on text blacklists, which can be easily circumvented, or harmful content classification.
We propose Latent Guard, a framework designed to improve safety measures in text-to-image generation.
Inspired by blacklist-based approaches, Latent Guard learns a latent space on top of the T2I model's text encoder, where it is possible to check the presence of harmful concepts.
arXiv Detail & Related papers (2024-04-11T17:59:52Z) - Silent Guardian: Protecting Text from Malicious Exploitation by Large Language Models [63.91178922306669]
We introduce Silent Guardian, a text protection mechanism against large language models (LLMs)
By carefully modifying the text to be protected, TPE can induce LLMs to first sample the end token, thus directly terminating the interaction.
We show that SG can effectively protect the target text under various configurations and achieve almost 100% protection success rate in some cases.
arXiv Detail & Related papers (2023-12-15T10:30:36Z) - Recoverable Privacy-Preserving Image Classification through Noise-like
Adversarial Examples [26.026171363346975]
Cloud-based image related services such as classification have become crucial.
In this study, we propose a novel privacypreserving image classification scheme.
encrypted images can be decrypted back into their original form with high fidelity (recoverable) using a secret key.
arXiv Detail & Related papers (2023-10-19T13:01:58Z) - SemStamp: A Semantic Watermark with Paraphrastic Robustness for Text Generation [72.10931780019297]
Existing watermarking algorithms are vulnerable to paraphrase attacks because of their token-level design.
We propose SemStamp, a robust sentence-level semantic watermarking algorithm based on locality-sensitive hashing (LSH)
Experimental results show that our novel semantic watermark algorithm is not only more robust than the previous state-of-the-art method on both common and bigram paraphrase attacks, but also is better at preserving the quality of generation.
arXiv Detail & Related papers (2023-10-06T03:33:42Z) - General Framework for Reversible Data Hiding in Texts Based on Masked
Language Modeling [15.136429369639686]
We propose a general framework to embed secret information into a given cover text.
The embedded information and the original cover text can be perfectly retrieved from the marked text.
Our results show that the original cover text and the secret information can be successfully embedded and extracted.
arXiv Detail & Related papers (2022-06-21T05:02:49Z) - Autoregressive Linguistic Steganography Based on BERT and Consistency
Coding [17.881686153284267]
Linguistic steganography (LS) conceals the presence of communication by embedding secret information into a text.
Recent algorithms use a language model (LM) to generate the steganographic text, which provides a higher payload compared with many previous arts.
We propose a novel autoregressive LS algorithm based on BERT and consistency coding, which achieves a better trade-off between embedding payload and system security.
arXiv Detail & Related papers (2022-03-26T02:36:55Z) - Semantics-Preserved Distortion for Personal Privacy Protection in Information Management [65.08939490413037]
This paper suggests a linguistically-grounded approach to distort texts while maintaining semantic integrity.
We present two distinct frameworks for semantic-preserving distortion: a generative approach and a substitutive approach.
We also explore privacy protection in a specific medical information management scenario, showing our method effectively limits sensitive data memorization.
arXiv Detail & Related papers (2022-01-04T04:01:05Z) - Reinforcement Learning on Encrypted Data [58.39270571778521]
We present a preliminary, experimental study of how a DQN agent trained on encrypted states performs in environments with discrete and continuous state spaces.
Our results highlight that the agent is still capable of learning in small state spaces even in presence of non-deterministic encryption, but performance collapses in more complex environments.
arXiv Detail & Related papers (2021-09-16T21:59:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.