Memorization for Good: Encryption with Autoregressive Language Models
- URL: http://arxiv.org/abs/2305.10445v2
- Date: Fri, 13 Oct 2023 18:25:11 GMT
- Title: Memorization for Good: Encryption with Autoregressive Language Models
- Authors: Samuel Stevens and Yu Su
- Abstract summary: We propose the first symmetric encryption algorithm with autoregressive language models (SELM)
We show that autoregressive LMs can encode arbitrary data into a compact real-valued vector (i.e., encryption) and then losslessly decode the vector to the original message (i.e. decryption) via random subspace optimization and greedy decoding.
- Score: 8.645826579841692
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Over-parameterized neural language models (LMs) can memorize and recite long
sequences of training data. While such memorization is normally associated with
undesired properties such as overfitting and information leaking, our work
casts memorization as an unexplored capability of LMs. We propose the first
symmetric encryption algorithm with autoregressive language models (SELM). We
show that autoregressive LMs can encode arbitrary data into a compact
real-valued vector (i.e., encryption) and then losslessly decode the vector to
the original message (i.e., decryption) via random subspace optimization and
greedy decoding. While SELM is not amenable to conventional cryptanalysis, we
investigate its security through a novel empirical variant of the classic
IND-CPA (indistinguishability under chosen-plaintext attack) game and show
promising results on security. Our code and datasets are available at
https://github.com/OSU-NLP-Group/SELM.
Related papers
- Decoding at the Speed of Thought: Harnessing Parallel Decoding of Lexical Units for LLMs [57.27982780697922]
Large language models have demonstrated exceptional capability in natural language understanding and generation.
However, their generation speed is limited by the inherently sequential nature of their decoding process.
This paper introduces Lexical Unit Decoding, a novel decoding methodology implemented in a data-driven manner.
arXiv Detail & Related papers (2024-05-24T04:35:13Z) - Rethinking LLM Memorization through the Lens of Adversarial Compression [93.13830893086681]
Large language models (LLMs) trained on web-scale datasets raise substantial concerns regarding permissible data usage.
One major question is whether these models "memorize" all their training data or they integrate many data sources in some way more akin to how a human would learn and synthesize information.
We propose the Adversarial Compression Ratio (ACR) as a metric for assessing memorization in LLMs.
arXiv Detail & Related papers (2024-04-23T15:49:37Z) - Robust Representation Learning for Privacy-Preserving Machine Learning:
A Multi-Objective Autoencoder Approach [0.9831489366502302]
We propose a robust representation learning framework for privacy-preserving machine learning (ppML)
Our method centers on training autoencoders in a multi-objective manner and then concatenating the latent and learned features from the encoding part as the encoded form of our data.
With our proposed framework, we can share our data and use third party tools without being under the threat of revealing its original form.
arXiv Detail & Related papers (2023-09-08T16:41:25Z) - In-context Autoencoder for Context Compression in a Large Language Model [70.7621953091318]
We propose the In-context Autoencoder (ICAE) to compress a long context into short compact memory slots.
ICAE is first pretrained using both autoencoding and language modeling objectives on massive text data.
arXiv Detail & Related papers (2023-07-13T17:59:21Z) - PEOPL: Characterizing Privately Encoded Open Datasets with Public Labels [59.66777287810985]
We introduce information-theoretic scores for privacy and utility, which quantify the average performance of an unfaithful user.
We then theoretically characterize primitives in building families of encoding schemes that motivate the use of random deep neural networks.
arXiv Detail & Related papers (2023-03-31T18:03:53Z) - Machine Learning-Aided Efficient Decoding of Reed-Muller Subcodes [59.55193427277134]
Reed-Muller (RM) codes achieve the capacity of general binary-input memoryless symmetric channels.
RM codes only admit limited sets of rates.
Efficient decoders are available for RM codes at finite lengths.
arXiv Detail & Related papers (2023-01-16T04:11:14Z) - Why do Nearest Neighbor Language Models Work? [93.71050438413121]
Language models (LMs) compute the probability of a text by sequentially computing a representation of an already-seen context.
Retrieval-augmented LMs have shown to improve over standard neural LMs, by accessing information retrieved from a large datastore.
arXiv Detail & Related papers (2023-01-07T11:12:36Z) - Effect of Homomorphic Encryption on the Performance of Training
Federated Learning Generative Adversarial Networks [10.030986278376567]
A Generative Adversarial Network (GAN) is a deep-learning generative model in the field of Machine Learning (ML)
In certain fields, such as medicine, the training data may be hospital patient records that are stored across different hospitals.
This paper will focus on the performance loss of training an FL-GAN with three different types of Homomorphic Encryption.
arXiv Detail & Related papers (2022-07-01T08:35:10Z) - On the Importance of Encrypting Deep Features [15.340540198612823]
We analyze model inversion attacks with only two assumptions: feature vectors of user data are known, and a black-box API for inference is provided.
Experiments have been conducted on state-of-the-art models in person re-identification, and two attack scenarios (i.e., recognizing auxiliary attributes and reconstructing user data) are investigated.
Results show that an adversary could successfully infer sensitive information even under severe constraints.
arXiv Detail & Related papers (2021-08-16T15:22:33Z) - Cryptotree: fast and accurate predictions on encrypted structured data [0.0]
Homomorphic Encryption (HE) is acknowledged for its ability to allow computation on encrypted data, where both the input and output are encrypted.
We propose Cryptotree, a framework that enables the use of Random Forests (RF), a very powerful learning procedure compared to linear regression.
arXiv Detail & Related papers (2020-06-15T11:48:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.