Extracting books from production language models
- URL: http://arxiv.org/abs/2601.02671v1
- Date: Tue, 06 Jan 2026 03:01:27 GMT
- Title: Extracting books from production language models
- Authors: Ahmed Ahmed, A. Feder Cooper, Sanmi Koyejo, Percy Liang,
- Abstract summary: It remains an open question if similar extraction is feasible for production LLMs.<n>In some cases, jailbroken Claude 3.7 Sonnet outputs entire books near-verbatim.<n>Even with model- and system-level safeguards, extraction of (in-copyright) training data remains a risk for production LLMs.
- Score: 65.85348210518937
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many unresolved legal questions over LLMs and copyright center on memorization: whether specific training data have been encoded in the model's weights during training, and whether those memorized data can be extracted in the model's outputs. While many believe that LLMs do not memorize much of their training data, recent work shows that substantial amounts of copyrighted text can be extracted from open-weight models. However, it remains an open question if similar extraction is feasible for production LLMs, given the safety measures these systems implement. We investigate this question using a two-phase procedure: (1) an initial probe to test for extraction feasibility, which sometimes uses a Best-of-N (BoN) jailbreak, followed by (2) iterative continuation prompts to attempt to extract the book. We evaluate our procedure on four production LLMs -- Claude 3.7 Sonnet, GPT-4.1, Gemini 2.5 Pro, and Grok 3 -- and we measure extraction success with a score computed from a block-based approximation of longest common substring (nv-recall). With different per-LLM experimental configurations, we were able to extract varying amounts of text. For the Phase 1 probe, it was unnecessary to jailbreak Gemini 2.5 Pro and Grok 3 to extract text (e.g, nv-recall of 76.8% and 70.3%, respectively, for Harry Potter and the Sorcerer's Stone), while it was necessary for Claude 3.7 Sonnet and GPT-4.1. In some cases, jailbroken Claude 3.7 Sonnet outputs entire books near-verbatim (e.g., nv-recall=95.8%). GPT-4.1 requires significantly more BoN attempts (e.g., 20X), and eventually refuses to continue (e.g., nv-recall=4.0%). Taken together, our work highlights that, even with model- and system-level safeguards, extraction of (in-copyright) training data remains a risk for production LLMs.
Related papers
- Prompt-Based One-Shot Exact Length-Controlled Generation with LLMs [56.47577824219207]
We present a prompt-based strategy that compels an off-the-shelf large language model to generate exactly a desired number of tokens.<n>The prompt appends countdown markers and explicit counting rules so that the model "writes while counting"<n>On MT-Bench-LI, strict length compliance with GPT-4.1 leaps from below 30% under naive prompts to above 95% with our countdown prompt.
arXiv Detail & Related papers (2025-08-19T13:12:01Z) - Be Careful When Fine-tuning On Open-Source LLMs: Your Fine-tuning Data Could Be Secretly Stolen! [77.5835471257498]
Fine-tuning on open-source Large Language Models (LLMs) with proprietary data is now a standard practice for downstream developers.<n>We reveal a new and concerning risk along with the practice: the creator of the open-source LLMs can later extract the private downstream fine-tuning data.
arXiv Detail & Related papers (2025-05-21T15:32:14Z) - Extracting memorized pieces of (copyrighted) books from open-weight language models [63.06081428612624]
We show that plaintiffs and defendants in copyright lawsuits over generative AI often make sweeping, opposing claims about the extent to which large language models (LLMs) have memorized plaintiffs' protected expression.<n>We show that these polarized positions dramatically oversimplify the relationship between memorization and copyright.<n>We discuss why our results have significant implications for copyright cases, though not ones that unambiguously favor either side.
arXiv Detail & Related papers (2025-05-18T21:06:32Z) - Information-Guided Identification of Training Data Imprint in (Proprietary) Large Language Models [52.439289085318634]
We show how to identify training data known to proprietary large language models (LLMs) by using information-guided probes.<n>Our work builds on a key observation: text passages with high surprisal are good search material for memorization probes.
arXiv Detail & Related papers (2025-03-15T10:19:15Z) - Extracting Memorized Training Data via Decomposition [24.198975804570072]
We demonstrate a simple, query-based decompositional method to extract news articles from two frontier Large Language Models.
We extract at least one sentence from 73 articles, and over 20% of verbatim sentences from 6 articles.
If replicable at scale, this training data extraction methodology could expose new LLM security and safety vulnerabilities.
arXiv Detail & Related papers (2024-09-18T23:59:32Z) - PlagBench: Exploring the Duality of Large Language Models in Plagiarism Generation and Detection [26.191836276118696]
We introduce textbfsf PlagBench, a dataset of 46.5K synthetic text pairs that represent three major types of plagiarism.<n>PlagBench is validated through a combination of fine-grained automatic evaluation and human annotation.<n>We show GPT-3.5 Turbo can produce high-quality paraphrases and summaries without significantly increasing text complexity compared to GPT-4 Turbo.
arXiv Detail & Related papers (2024-06-24T03:29:53Z) - PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition [10.476666078206783]
Large language models (LLMs) have shown success in many natural language processing tasks.
Despite rigorous safety alignment processes, supposedly safety-aligned LLMs like Llama 2 and Claude 2 are still susceptible to jailbreaks.
We propose PARDEN, which avoids the domain shift by simply asking the model to repeat its own outputs.
arXiv Detail & Related papers (2024-05-13T17:08:42Z) - Make Them Spill the Beans! Coercive Knowledge Extraction from
(Production) LLMs [31.80386572346993]
We exploit the fact that even when an LLM rejects a toxic request, a harmful response often hides deep in the output logits.
This approach differs from and outperforms jail-breaking methods, achieving 92% effectiveness compared to 62%, and is 10 to 20 times faster.
Our findings indicate that interrogation can extract toxic knowledge even from models specifically designed for coding tasks.
arXiv Detail & Related papers (2023-12-08T01:41:36Z) - Large Language Model-Aware In-Context Learning for Code Generation [75.68709482932903]
Large language models (LLMs) have shown impressive in-context learning (ICL) ability in code generation.
We propose a novel learning-based selection approach named LAIL (LLM-Aware In-context Learning) for code generation.
arXiv Detail & Related papers (2023-10-15T06:12:58Z) - Private-Library-Oriented Code Generation with Large Language Models [52.73999698194344]
This paper focuses on utilizing large language models (LLMs) for code generation in private libraries.
We propose a novel framework that emulates the process of programmers writing private code.
We create four private library benchmarks, including TorchDataEval, TorchDataComplexEval, MonkeyEval, and BeatNumEval.
arXiv Detail & Related papers (2023-07-28T07:43:13Z) - Semantic Compression With Large Language Models [1.0874100424278175]
Large language models (LLMs) are revolutionizing information retrieval, question answering, summarization, and code generation tasks.
LLMs are inherently limited by the number of input and output tokens that can be processed at once.
This paper presents three contributions to research on LLMs.
arXiv Detail & Related papers (2023-04-25T01:47:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.