Memorization for Good: Encryption with Autoregressive Language Models
- URL: http://arxiv.org/abs/2305.10445v2
- Date: Fri, 13 Oct 2023 18:25:11 GMT
- Title: Memorization for Good: Encryption with Autoregressive Language Models
- Authors: Samuel Stevens and Yu Su
- Abstract summary: We propose the first symmetric encryption algorithm with autoregressive language models (SELM)
We show that autoregressive LMs can encode arbitrary data into a compact real-valued vector (i.e., encryption) and then losslessly decode the vector to the original message (i.e. decryption) via random subspace optimization and greedy decoding.
- Score: 8.645826579841692
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Over-parameterized neural language models (LMs) can memorize and recite long
sequences of training data. While such memorization is normally associated with
undesired properties such as overfitting and information leaking, our work
casts memorization as an unexplored capability of LMs. We propose the first
symmetric encryption algorithm with autoregressive language models (SELM). We
show that autoregressive LMs can encode arbitrary data into a compact
real-valued vector (i.e., encryption) and then losslessly decode the vector to
the original message (i.e., decryption) via random subspace optimization and
greedy decoding. While SELM is not amenable to conventional cryptanalysis, we
investigate its security through a novel empirical variant of the classic
IND-CPA (indistinguishability under chosen-plaintext attack) game and show
promising results on security. Our code and datasets are available at
https://github.com/OSU-NLP-Group/SELM.
Related papers
- Cryptanalysis via Machine Learning Based Information Theoretic Metrics [58.96805474751668]
We propose two novel applications of machine learning (ML) algorithms to perform cryptanalysis on any cryptosystem.
These algorithms can be readily applied in an audit setting to evaluate the robustness of a cryptosystem.
We show that our classification model correctly identifies the encryption schemes that are not IND-CPA secure, such as DES, RSA, and AES ECB, with high accuracy.
arXiv Detail & Related papers (2025-01-25T04:53:36Z) - FIRP: Faster LLM inference via future intermediate representation prediction [54.897493351694195]
FIRP generates multiple tokens instead of one at each decoding step.
We conduct extensive experiments, showing a speedup ratio of 1.9x-3x in several models and datasets.
arXiv Detail & Related papers (2024-10-27T15:53:49Z) - CodeCipher: Learning to Obfuscate Source Code Against LLMs [5.872773591957006]
We propose CodeCipher, a novel method that perturbs privacy from code while preserving the original response from LLMs.
CodeCipher transforms the LLM's embedding matrix so that each row corresponds to a different word in the original matrix, forming a token-to-token confusion mapping for obfuscating source code.
Results show that our model successfully confuses the privacy in source code while preserving the original LLM's performance.
arXiv Detail & Related papers (2024-10-08T08:28:54Z) - Unlocking Memorization in Large Language Models with Dynamic Soft Prompting [66.54460367290146]
Large language models (LLMs) have revolutionized natural language processing (NLP) tasks such as summarization, question answering, and translation.
LLMs pose significant security risks due to their tendency to memorize training data, leading to potential privacy breaches and copyright infringement.
We propose a novel method for estimating LLM memorization using dynamic, prefix-dependent soft prompts.
arXiv Detail & Related papers (2024-09-20T18:56:32Z) - Rethinking LLM Memorization through the Lens of Adversarial Compression [93.13830893086681]
Large language models (LLMs) trained on web-scale datasets raise substantial concerns regarding permissible data usage.
One major question is whether these models "memorize" all their training data or they integrate many data sources in some way more akin to how a human would learn and synthesize information.
We propose the Adversarial Compression Ratio (ACR) as a metric for assessing memorization in LLMs.
arXiv Detail & Related papers (2024-04-23T15:49:37Z) - Robust Representation Learning for Privacy-Preserving Machine Learning:
A Multi-Objective Autoencoder Approach [0.9831489366502302]
We propose a robust representation learning framework for privacy-preserving machine learning (ppML)
Our method centers on training autoencoders in a multi-objective manner and then concatenating the latent and learned features from the encoding part as the encoded form of our data.
With our proposed framework, we can share our data and use third party tools without being under the threat of revealing its original form.
arXiv Detail & Related papers (2023-09-08T16:41:25Z) - In-context Autoencoder for Context Compression in a Large Language Model [70.7621953091318]
We propose the In-context Autoencoder (ICAE) to compress a long context into short compact memory slots.
ICAE is first pretrained using both autoencoding and language modeling objectives on massive text data.
arXiv Detail & Related papers (2023-07-13T17:59:21Z) - Machine Learning-Aided Efficient Decoding of Reed-Muller Subcodes [59.55193427277134]
Reed-Muller (RM) codes achieve the capacity of general binary-input memoryless symmetric channels.
RM codes only admit limited sets of rates.
Efficient decoders are available for RM codes at finite lengths.
arXiv Detail & Related papers (2023-01-16T04:11:14Z) - Effect of Homomorphic Encryption on the Performance of Training
Federated Learning Generative Adversarial Networks [10.030986278376567]
A Generative Adversarial Network (GAN) is a deep-learning generative model in the field of Machine Learning (ML)
In certain fields, such as medicine, the training data may be hospital patient records that are stored across different hospitals.
This paper will focus on the performance loss of training an FL-GAN with three different types of Homomorphic Encryption.
arXiv Detail & Related papers (2022-07-01T08:35:10Z) - Cryptotree: fast and accurate predictions on encrypted structured data [0.0]
Homomorphic Encryption (HE) is acknowledged for its ability to allow computation on encrypted data, where both the input and output are encrypted.
We propose Cryptotree, a framework that enables the use of Random Forests (RF), a very powerful learning procedure compared to linear regression.
arXiv Detail & Related papers (2020-06-15T11:48:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.