Traces of Memorisation in Large Language Models for Code
- URL: http://arxiv.org/abs/2312.11658v2
- Date: Mon, 15 Jan 2024 21:24:41 GMT
- Title: Traces of Memorisation in Large Language Models for Code
- Authors: Ali Al-Kaswan and Maliheh Izadi and Arie van Deursen
- Abstract summary: Large language models for code are commonly trained on large unsanitised corpora of source code scraped from the internet.
We compare the rate of memorisation with large language models trained on natural language.
We find that large language models for code are vulnerable to data extraction attacks, like their natural language counterparts.
- Score: 16.125924759649106
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models have gained significant popularity because of their
ability to generate human-like text and potential applications in various
fields, such as Software Engineering. Large language models for code are
commonly trained on large unsanitised corpora of source code scraped from the
internet. The content of these datasets is memorised and can be extracted by
attackers with data extraction attacks. In this work, we explore memorisation
in large language models for code and compare the rate of memorisation with
large language models trained on natural language. We adopt an existing
benchmark for natural language and construct a benchmark for code by
identifying samples that are vulnerable to attack. We run both benchmarks
against a variety of models, and perform a data extraction attack. We find that
large language models for code are vulnerable to data extraction attacks, like
their natural language counterparts. From the training data that was identified
to be potentially extractable we were able to extract 47% from a
CodeGen-Mono-16B code completion model. We also observe that models memorise
more, as their parameter count grows, and that their pre-training data are also
vulnerable to attack. We also find that data carriers are memorised at a higher
rate than regular code or documentation and that different model architectures
memorise different samples. Data leakage has severe outcomes, so we urge the
research community to further investigate the extent of this phenomenon using a
wider range of models and extraction techniques in order to build safeguards to
mitigate this issue.
Related papers
- Scalable Extraction of Training Data from (Production) Language Models [93.7746567808049]
This paper studies extractable memorization: training data that an adversary can efficiently extract by querying a machine learning model without prior knowledge of the training dataset.
We show an adversary can extract gigabytes of training data from open-source language models like Pythia or GPT-Neo, semi-open models like LLaMA or Falcon, and closed models like ChatGPT.
arXiv Detail & Related papers (2023-11-28T18:47:03Z) - Language Models are Universal Embedders [48.12992614723464]
We show that pre-trained transformer decoders can embed universally when finetuned on limited English data.
Our models achieve competitive performance on different embedding tasks by minimal training data.
These results provide evidence of a promising path towards building powerful unified embedders.
arXiv Detail & Related papers (2023-10-12T11:25:46Z) - CodeLMSec Benchmark: Systematically Evaluating and Finding Security
Vulnerabilities in Black-Box Code Language Models [58.27254444280376]
Large language models (LLMs) for automatic code generation have achieved breakthroughs in several programming tasks.
Training data for these models is usually collected from the Internet (e.g., from open-source repositories) and is likely to contain faults and security vulnerabilities.
This unsanitized training data can cause the language models to learn these vulnerabilities and propagate them during the code generation procedure.
arXiv Detail & Related papers (2023-02-08T11:54:07Z) - Multi-lingual Evaluation of Code Generation Models [82.7357812992118]
We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X.
These datasets cover over 10 programming languages.
We are able to assess the performance of code generation models in a multi-lingual fashion.
arXiv Detail & Related papers (2022-10-26T17:17:06Z) - Paraphrastic Representations at Scale [134.41025103489224]
We release trained models for English, Arabic, German, French, Spanish, Russian, Turkish, and Chinese languages.
We train these models on large amounts of data, achieving significantly improved performance from the original papers.
arXiv Detail & Related papers (2021-04-30T16:55:28Z) - Extracting Training Data from Large Language Models [78.3839333127544]
This paper demonstrates that an adversary can perform a training data extraction attack to recover individual training examples by querying the language model.
We demonstrate our attack on GPT-2, a language model trained on scrapes of the public Internet, and are able to extract hundreds of verbatim text sequences from the model's training data.
arXiv Detail & Related papers (2020-12-14T18:39:09Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Automated Source Code Generation and Auto-completion Using Deep
Learning: Comparing and Discussing Current Language-Model-Related Approaches [0.0]
This paper compares different deep learning architectures to create and use language models based on programming code.
We discuss each approach's different strengths and weaknesses and what gaps we find to evaluate the language models or apply them in a real programming context.
arXiv Detail & Related papers (2020-09-16T15:17:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.