Related papers: SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore

SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore

URL: http://arxiv.org/abs/2308.04430v2
Date: Wed, 31 Jul 2024 02:15:31 GMT
Title: SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore
Authors: Sewon Min, Suchin Gururangan, Eric Wallace, Weijia Shi, Hannaneh Hajishirzi, Noah A. Smith, Luke Zettlemoyer,
Abstract summary: We present SILO, a new language model that manages this risk-performance tradeoff during inference. SILO is built by (1) training a parametric LM on Open License Corpus (OLC), a new corpus we curate with 228B tokens of public domain and permissively licensed text. Access to the datastore greatly improves out of domain performance, closing 90% of the performance gap with an LM trained on the Pile.
Score: 159.21914121143885
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The legality of training language models (LMs) on copyrighted or otherwise restricted data is under intense debate. However, as we show, model performance significantly degrades if trained only on low-risk text (e.g., out-of-copyright books or government documents), due to its limited size and domain coverage. We present SILO, a new language model that manages this risk-performance tradeoff during inference. SILO is built by (1) training a parametric LM on Open License Corpus (OLC), a new corpus we curate with 228B tokens of public domain and permissively licensed text and (2) augmenting it with a more general and easily modifiable nonparametric datastore (e.g., containing copyrighted books or news) that is only queried during inference. The datastore allows use of high-risk data without training on it, supports sentence-level data attribution, and enables data producers to opt out from the model by removing content from the store. These capabilities can foster compliance with data-use regulations such as the fair use doctrine in the United States and the GDPR in the European Union. Our experiments show that the parametric LM struggles on domains not covered by OLC. However, access to the datastore greatly improves out of domain performance, closing 90% of the performance gap with an LM trained on the Pile, a more diverse corpus with mostly high-risk text. We also analyze which nonparametric approach works best, where the remaining errors lie, and how performance scales with datastore size. Our results suggest that it is possible to build high quality language models while mitigating their legal risk.

Related papers

Can Performant LLMs Be Ethical? Quantifying the Impact of Web Crawling Opt-Outs [42.58914814153536]
We quantify the performance difference between models trained on datasets that comply with web crawling opt-outs, and those that do not. Our experiments with 1.5B models show that, as of January 2025, compliance with web data opt-outs does not degrade general knowledge acquisition. However, in specialized domains such as biomedical research, excluding major publishers leads to performance declines.
arXiv Detail & Related papers (2025-04-08T17:08:06Z)
SUV: Scalable Large Language Model Copyright Compliance with Regularized Selective Unlearning [22.76025238218253]
SUV is a selective unlearning framework designed to prevent Large Language Models from memorizing copyrighted content. We replace verbatim copyrighted content with plausible and coherent alternatives. We validate our approach using a large-scale dataset of 500 famous books.
arXiv Detail & Related papers (2025-03-29T02:33:26Z)
Information-Guided Identification of Training Data Imprint in (Proprietary) Large Language Models [52.439289085318634]
We show how to identify training data known to proprietary large language models (LLMs) by using information-guided probes. Our work builds on a key observation: text passages with high surprisal are good search material for memorization probes.
arXiv Detail & Related papers (2025-03-15T10:19:15Z)
Enhancing Code Generation for Low-Resource Languages: No Silver Bullet [55.39571645315926]
Large Language Models (LLMs) rely on large and diverse datasets to learn syntax, semantics, and usage patterns of programming languages. For low-resource languages, the limited availability of such data hampers the models' ability to generalize effectively. We present an empirical study investigating the effectiveness of several approaches for boosting LLMs' performance on low-resource languages.
arXiv Detail & Related papers (2025-01-31T12:23:28Z)
Towards Operationalizing Right to Data Protection [8.61230665736263]
RegText is a framework that injects imperceptible correlations into natural language datasets effectively rendering them unlearnable without affecting content. We demonstrate RegText's utility through rigorous empirical analysis of small and large LMs. RegText can newer models like GPT-4o and Llama from learning on our generated data, resulting in a drop in their test accuracy compared to their zero-shot performance.
arXiv Detail & Related papers (2024-11-13T10:43:31Z)
TransformLLM: Adapting Large Language Models via LLM-Transformed Reading Comprehension Text [5.523385345486362]
We have developed language models specifically designed for legal applications. Our innovative approach significantly improves capabilities in legal tasks by using Large Language Models (LLMs) to convert raw training data into reading comprehension text.
arXiv Detail & Related papers (2024-10-28T19:32:18Z)
Bayesian scaling laws for in-context learning [72.17734205418502]
In-context learning (ICL) is a powerful technique for getting language models to perform complex tasks with no training updates. We show that ICL approximates a Bayesian learner and develop a family of novel Bayesian scaling laws for ICL.
arXiv Detail & Related papers (2024-10-21T21:45:22Z)
Strategies for Improving NL-to-FOL Translation with LLMs: Data Generation, Incremental Fine-Tuning, and Verification [9.36179617282876]
We create a high-quality FOL-annotated subset of ProofWriter dataset using GPT-4o. Our results show state-of-the-art performance for ProofWriter and ProntoQA datasets using ProofFOL on LLaMA-2 and Mistral models.
arXiv Detail & Related papers (2024-09-24T21:24:07Z)
Nob-MIAs: Non-biased Membership Inference Attacks Assessment on Large Language Models with Ex-Post Dataset Construction [37.69303106863453]
Membership Inference Attacks (MIAs) aim to detect whether specific documents were used in a given Large Language Models (LLMs) pretraining. This paper addresses the evaluation of MIAs on LLMs with partially inferable training sets. We propose and validate algorithms to create non-biased'' and non-classifiable'' datasets for fairer MIA assessment.
arXiv Detail & Related papers (2024-08-12T07:49:28Z)
Scaling Retrieval-Based Language Models with a Trillion-Token Datastore [85.4310806466002]
We find that increasing the size of the datastore used by a retrieval-based LM monotonically improves language modeling and several downstream tasks without obvious saturation. By plotting compute-optimal scaling curves with varied datastore, model, and pretraining data sizes, we show that using larger datastores can significantly improve model performance for the same training compute budget.
arXiv Detail & Related papers (2024-07-09T08:27:27Z)
Entropy Law: The Story Behind Data Compression and LLM Performance [115.70395740286422]
We find that model performance is negatively correlated to the compression ratio of training data, which usually yields a lower training loss. Based on the findings of the entropy law, we propose a quite efficient and universal data selection method. We also present an interesting application of entropy law that can detect potential performance risks at the beginning of model training.
arXiv Detail & Related papers (2024-07-09T08:14:29Z)
Rethinking LLM Memorization through the Lens of Adversarial Compression [93.13830893086681]
Large language models (LLMs) trained on web-scale datasets raise substantial concerns regarding permissible data usage. One major question is whether these models "memorize" all their training data or they integrate many data sources in some way more akin to how a human would learn and synthesize information. We propose the Adversarial Compression Ratio (ACR) as a metric for assessing memorization in LLMs.
arXiv Detail & Related papers (2024-04-23T15:49:37Z)
FedJudge: Federated Legal Large Language Model [10.70953602515144]
Large Language Models (LLMs) have gained prominence in the field of Legal Intelligence, offering potential applications in assisting legal professionals and laymen. This paper explores the integration of Legal LLMs with Federated Learning (FL) methodologies. We propose the first Federated Legal Large Language Model (FedJudge) framework, which fine-tunes Legal LLMs efficiently and effectively.
arXiv Detail & Related papers (2023-09-15T05:45:44Z)
Are You Copying My Model? Protecting the Copyright of Large Language Models for EaaS via Backdoor Watermark [58.60940048748815]
Companies have begun to offer Embedding as a Service (E) based on large language models (LLMs) E is vulnerable to model extraction attacks, which can cause significant losses for the owners of LLMs. We propose an Embedding Watermark method called EmbMarker that implants backdoors on embeddings.
arXiv Detail & Related papers (2023-05-17T08:28:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.