Fugu-MT 論文翻訳(概要): Extracting books from production language models

論文の概要: Extracting books from production language models

arxiv url: http://arxiv.org/abs/2601.02671v1
Date: Tue, 06 Jan 2026 03:01:27 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-07 17:02:12.781662
Title: Extracting books from production language models
Title（参考訳）: 生産言語モデルから書籍を抽出する
Authors: Ahmed Ahmed, A. Feder Cooper, Sanmi Koyejo, Percy Liang,
Abstract要約: 同様の抽出がLLMの生産に可能であるかどうかについては、未解決のままである。ジェイルブレイクされたクロード3.7 ソンネットは、全書籍をほぼ全文出力する場合もある。モデルおよびシステムレベルのセーフガードであっても、(コピーライト内での)トレーニングデータの抽出はLLM生産のリスクである。
参考スコア（独自算出の注目度）: 65.85348210518937
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Many unresolved legal questions over LLMs and copyright center on memorization: whether specific training data have been encoded in the model's weights during training, and whether those memorized data can be extracted in the model's outputs. While many believe that LLMs do not memorize much of their training data, recent work shows that substantial amounts of copyrighted text can be extracted from open-weight models. However, it remains an open question if similar extraction is feasible for production LLMs, given the safety measures these systems implement. We investigate this question using a two-phase procedure: (1) an initial probe to test for extraction feasibility, which sometimes uses a Best-of-N (BoN) jailbreak, followed by (2) iterative continuation prompts to attempt to extract the book. We evaluate our procedure on four production LLMs -- Claude 3.7 Sonnet, GPT-4.1, Gemini 2.5 Pro, and Grok 3 -- and we measure extraction success with a score computed from a block-based approximation of longest common substring (nv-recall). With different per-LLM experimental configurations, we were able to extract varying amounts of text. For the Phase 1 probe, it was unnecessary to jailbreak Gemini 2.5 Pro and Grok 3 to extract text (e.g, nv-recall of 76.8% and 70.3%, respectively, for Harry Potter and the Sorcerer's Stone), while it was necessary for Claude 3.7 Sonnet and GPT-4.1. In some cases, jailbroken Claude 3.7 Sonnet outputs entire books near-verbatim (e.g., nv-recall=95.8%). GPT-4.1 requires significantly more BoN attempts (e.g., 20X), and eventually refuses to continue (e.g., nv-recall=4.0%). Taken together, our work highlights that, even with model- and system-level safeguards, extraction of (in-copyright) training data remains a risk for production LLMs.
Abstract（参考訳）: LLMと著作権に関する多くの未解決の法的疑問は、トレーニング中に特定のトレーニングデータがモデルの重みにエンコードされているかどうか、そしてそれらの記憶されたデータはモデルの出力で抽出できるかどうかである。 LLMはトレーニングデータのほとんどを記憶していないと多くの人が考えているが、最近の研究は、オープンウェイトモデルからかなりの量の著作権付きテキストを抽出できることを示している。しかし、これらのシステムが実装する安全対策を考えると、同様の抽出が生産用LLMで実現可能かどうかについては未解決のままである。本研究は,(1)Best-of-N(BoN)ジェイルブレイクを用いて本本を抽出しようとする反復的継続プロンプトを用いて,抽出可能性テストのための初期プローブを試作する2段階の手順を用いて検討する。我々は,4つのLLM (Claude 3.7 Sonnet, GPT-4.1, Gemini 2.5 Pro, Grok 3) の手順を評価し,最も長い共通サブストリング(nv-recall)のブロックベース近似を用いて抽出成功度を測定した。 LLM毎の実験構成が異なるため、さまざまなテキストを抽出することができた。フェーズ1のプローブでは、ジェミニ2.5 ProとGrok 3を脱獄させる必要がなく(例えば、ハリー・ポッターとSorcerer's Stoneのnv-recallは76.8%と70.3%)、クロード3.7 SonnetとGPT-4.1が必要だった。ジェイルブレイクされたClaude 3.7 Sonnetは、全書籍をほぼバーバティム(eg , nv-recall=95.8%)に出力する。 GPT-4.1 では BoN の試行(例: 20X)が大幅に増加し、最終的には継続を拒否する(例: nv-recall=4.0%)。私たちの研究は、モデルとシステムレベルのセーフガードであっても、(コピーライト内での)トレーニングデータの抽出がLLM生産のリスクであることを強調しています。

論文の概要: Extracting books from production language models

関連論文リスト