SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore
- URL: http://arxiv.org/abs/2308.04430v1
- Date: Tue, 8 Aug 2023 17:58:15 GMT
- Title: SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore
- Authors: Sewon Min, Suchin Gururangan, Eric Wallace, Hannaneh Hajishirzi, Noah
A. Smith, Luke Zettlemoyer
- Abstract summary: We present SILO, a new language model that manages this risk-performance tradeoff during inference.
SILO is built by (1) training a parametric LM on Open License Corpus (OLC), a new corpus we curate with 228B tokens of public domain and permissively licensed text.
Access to the datastore greatly improves out of domain performance, closing 90% of the performance gap with an LM trained on the Pile.
- Score: 125.06066299987106
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The legality of training language models (LMs) on copyrighted or otherwise
restricted data is under intense debate. However, as we show, model performance
significantly degrades if trained only on low-risk text (e.g., out-of-copyright
books or government documents), due to its limited size and domain coverage. We
present SILO, a new language model that manages this risk-performance tradeoff
during inference. SILO is built by (1) training a parametric LM on Open License
Corpus (OLC), a new corpus we curate with 228B tokens of public domain and
permissively licensed text and (2) augmenting it with a more general and easily
modifiable nonparametric datastore (e.g., containing copyrighted books or news)
that is only queried during inference. The datastore allows use of high-risk
data without training on it, supports sentence-level data attribution, and
enables data producers to opt out from the model by removing content from the
store. These capabilities can foster compliance with data-use regulations such
as the fair use doctrine in the United States and the GDPR in the European
Union. Our experiments show that the parametric LM struggles on domains not
covered by OLC. However, access to the datastore greatly improves out of domain
performance, closing 90% of the performance gap with an LM trained on the Pile,
a more diverse corpus with mostly high-risk text. We also analyze which
nonparametric approach works best, where the remaining errors lie, and how
performance scales with datastore size. Our results suggest that it is possible
to build high quality language models while mitigating their legal risk.
Related papers
- Scaling Retrieval-Based Language Models with a Trillion-Token Datastore [85.4310806466002]
We find that increasing the size of the datastore used by a retrieval-based LM monotonically improves language modeling and several downstream tasks without obvious saturation.
By plotting compute-optimal scaling curves with varied datastore, model, and pretraining data sizes, we show that using larger datastores can significantly improve model performance for the same training compute budget.
arXiv Detail & Related papers (2024-07-09T08:27:27Z) - Entropy Law: The Story Behind Data Compression and LLM Performance [115.70395740286422]
We find that model performance is negatively correlated to the compression ratio of training data, which usually yields a lower training loss.
Based on the findings of the entropy law, we propose a quite efficient and universal data selection method.
We also present an interesting application of entropy law that can detect potential performance risks at the beginning of model training.
arXiv Detail & Related papers (2024-07-09T08:14:29Z) - LLM-Generated Natural Language Meets Scaling Laws: New Explorations and Data Augmentation Methods [3.333401582174629]
We introduce the scaling laws to intrinsically calculate large language models (LLMNL) and human natural language (HNL)
Through experiments, we reveal slight deviations from Mandelbrot's law in LLMNL, underscore a complexity advantage in HNL, and supplement an interpretive discussion on language style.
We introduce a novel data augmentation method for few-shot text classification, termed ZGPTDA, which leverages fuzzy computing mechanisms driven by the conformity to scaling laws.
arXiv Detail & Related papers (2024-06-29T05:40:17Z) - Evaluating Copyright Takedown Methods for Language Models [100.38129820325497]
Language models (LMs) derive their capabilities from extensive training on diverse data, including potentially copyrighted material.
This paper introduces the first evaluation of the feasibility and side effects of copyright takedowns for LMs.
We examine several strategies, including adding system prompts, decoding-time filtering interventions, and unlearning approaches.
arXiv Detail & Related papers (2024-06-26T18:09:46Z) - Rethinking LLM Memorization through the Lens of Adversarial Compression [93.13830893086681]
Large language models (LLMs) trained on web-scale datasets raise substantial concerns regarding permissible data usage.
One major question is whether these models "memorize" all their training data or they integrate many data sources in some way more akin to how a human would learn and synthesize information.
We propose the Adversarial Compression Ratio (ACR) as a metric for assessing memorization in LLMs.
arXiv Detail & Related papers (2024-04-23T15:49:37Z) - FedJudge: Federated Legal Large Language Model [10.70953602515144]
Large Language Models (LLMs) have gained prominence in the field of Legal Intelligence, offering potential applications in assisting legal professionals and laymen.
This paper explores the integration of Legal LLMs with Federated Learning (FL) methodologies.
We propose the first Federated Legal Large Language Model (FedJudge) framework, which fine-tunes Legal LLMs efficiently and effectively.
arXiv Detail & Related papers (2023-09-15T05:45:44Z) - Recovering from Privacy-Preserving Masking with Large Language Models [14.828717714653779]
We use large language models (LLMs) to suggest substitutes of masked tokens.
We show that models trained on the obfuscation corpora are able to achieve comparable performance with the ones trained on the original data.
arXiv Detail & Related papers (2023-09-12T16:39:41Z) - Are You Copying My Model? Protecting the Copyright of Large Language
Models for EaaS via Backdoor Watermark [58.60940048748815]
Companies have begun to offer Embedding as a Service (E) based on large language models (LLMs)
E is vulnerable to model extraction attacks, which can cause significant losses for the owners of LLMs.
We propose an Embedding Watermark method called EmbMarker that implants backdoors on embeddings.
arXiv Detail & Related papers (2023-05-17T08:28:54Z) - Memorization for Good: Encryption with Autoregressive Language Models [8.645826579841692]
We propose the first symmetric encryption algorithm with autoregressive language models (SELM)
We show that autoregressive LMs can encode arbitrary data into a compact real-valued vector (i.e., encryption) and then losslessly decode the vector to the original message (i.e. decryption) via random subspace optimization and greedy decoding.
arXiv Detail & Related papers (2023-05-15T05:42:34Z) - CUE Vectors: Modular Training of Language Models Conditioned on Diverse
Contextual Signals [11.310756148007753]
We propose a framework to modularize the training of neural language models that use diverse forms of sentence-external context (including metadata)
Our approach, contextual universal embeddings (CUE), trains LMs on one set of context, such as date and author, and adapts to novel metadata types, such as article title, or previous sentence.
We validate the CUE framework on a NYTimes text corpus with multiple metadata types, for which the LM perplexity can be lowered from 36.6 to 27.4 by conditioning on context.
arXiv Detail & Related papers (2022-03-16T17:37:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.