Related papers: Evaluating $n$-Gram Novelty of Language Models Using Rusty-DAWG

Evaluating $n$-Gram Novelty of Language Models Using Rusty-DAWG

URL: http://arxiv.org/abs/2406.13069v3
Date: Fri, 04 Oct 2024 16:42:20 GMT
Title: Evaluating $n$-Gram Novelty of Language Models Using Rusty-DAWG
Authors: William Merrill, Noah A. Smith, Yanai Elazar,
Abstract summary: We investigate the extent to which modern LMs generate $n$-grams from their training data. We develop Rusty-DAWG, a novel search tool inspired by indexing of genomic data.
Score: 57.14250086701313
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: How novel are texts generated by language models (LMs) relative to their training corpora? In this work, we investigate the extent to which modern LMs generate $n$-grams from their training data, evaluating both (i) the probability LMs assign to complete training $n$-grams and (ii) $n$-novelty, the proportion of $n$-grams generated by an LM that did not appear in the training data (for arbitrarily large $n$). To enable arbitrary-length $n$-gram search over a corpus in constant time w.r.t. corpus size, we develop Rusty-DAWG, a novel search tool inspired by indexing of genomic data. We compare the novelty of LM-generated text to human-written text and explore factors that affect generation novelty, focusing on the Pythia models. We find that, for $n > 4$, LM-generated text is less novel than human-written text, though it is more novel for smaller $n$. Larger LMs and more constrained decoding strategies both decrease novelty. Finally, we show that LMs complete $n$-grams with lower loss if they are more frequent in the training data. Overall, our results reveal factors influencing the novelty of LM-generated text, and we release Rusty-DAWG to facilitate further pretraining data research.

Related papers

Zero-Shot Statistical Tests for LLM-Generated Text Detection using Finite Sample Concentration Inequalities [13.657259851747126]
Verifying the provenance of content is crucial to the function of many organizations, e.g., educational institutions, social media platforms, firms, etc. This problem is becoming increasingly challenging as text generated by Large Language Models (LLMs) becomes almost indistinguishable from human-generated content. We show that our tests' type I and type II errors decrease exponentially as text length increases. Practically, our work enables guaranteed finding of the origin of harmful or false LLM-generated text, which can be useful for combating misinformation and compliance with emerging AI regulations.
arXiv Detail & Related papers (2025-01-04T23:51:43Z)
Theoretical Proof that Generated Text in the Corpus Leads to the Collapse of Auto-regressive Language Models [26.117724170912552]
This paper presents theoretical proof that once a corpus (such as the World Wide Web) begins to incorporate generated text, LM collapse is bound to occur. We express our concerns about the current situation in which an increasing amount of generated text may be used in LM training.
arXiv Detail & Related papers (2024-12-19T14:11:15Z)
Nearest Neighbor Speculative Decoding for LLM Generation and Attribution [87.3259169631789]
Nearest Speculative Decoding (NEST) is capable of incorporating real-world text spans of arbitrary length into the LM generations and providing attribution to their sources. NEST significantly enhances the generation quality and attribution rate of the base LM across a variety of knowledge-intensive tasks. In addition, NEST substantially improves the generation speed, achieving a 1.8x speedup in inference time when applied to Llama-2-Chat 70B.
arXiv Detail & Related papers (2024-05-29T17:55:03Z)
Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens [138.36729703589512]
We show that $n$-gram language models are still relevant in this era of neural large language models (LLMs) This was done by modernizing $n$-gram LMs in two aspects. First, we train them at the same data scale as neural LLMs -- 5 trillion tokens. Second, existing $n$-gram LMs use small $n$ which hinders their performance; we instead allow $n$ to be arbitrarily large, by introducing a new $infty$-gram LM with backoff.
arXiv Detail & Related papers (2024-01-30T19:03:49Z)
TOPFORMER: Topology-Aware Authorship Attribution of Deepfake Texts with Diverse Writing Styles [14.205559299967423]
Recent advances in Large Language Models (LLMs) have enabled the generation of open-ended high-quality texts, that are non-trivial to distinguish from human-written texts. Users with malicious intent can easily use these open-sourced LLMs to generate harmful texts and dis/misinformation at scale. To mitigate this problem, a computational method to determine if a given text is a deepfake text or not is desired. We propose TopFormer to improve existing AA solutions by capturing more linguistic patterns in deepfake texts.
arXiv Detail & Related papers (2023-09-22T15:32:49Z)
Retrieval-Pretrained Transformer: Long-range Language Modeling with Self-retrieval [51.437420003471615]
We propose the Retrieval-Pretrained Transformer (RPT), an architecture and training procedure for jointly training a retrieval-augmented LM from scratch. RPT improves retrieval quality and subsequently perplexity across the board compared to strong baselines.
arXiv Detail & Related papers (2023-06-23T10:18:02Z)
Shall We Pretrain Autoregressive Language Models with Retrieval? A Comprehensive Study [115.96080028033904]
We study a scalable pre-trained retrieval-augmented LM (i.e., RETRO) compared with standard GPT and retrieval-augmented GPT. Our findings highlight the promising direction of pretraining autoregressive LMs with retrieval as future foundation models.
arXiv Detail & Related papers (2023-04-13T18:04:19Z)
Stealing the Decoding Algorithms of Language Models [56.369946232765656]
A key component of generating text from modern language models (LM) is the selection and tuning of decoding algorithms. In this work, we show, for the first time, that an adversary with typical API access to an LM can steal the type and hyper parameters of its decoding algorithms. Our attack is effective against popular LMs used in text generation APIs, including GPT-2, GPT-3 and GPT-Neo.
arXiv Detail & Related papers (2023-03-08T17:15:58Z)
Discovering Language Model Behaviors with Model-Written Evaluations [18.24267922379281]
As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Here, we automatically generate evaluations with LMs. We generate 154 datasets and discover new cases of inverse scaling where LMs get worse with size.
arXiv Detail & Related papers (2022-12-19T05:13:52Z)
You can't pick your neighbors, or can you? When and how to rely on retrieval in the $k$NN-LM [65.74934004876914]
Retrieval-enhanced language models (LMs) condition their predictions on text retrieved from large external datastores. One such approach, the $k$NN-LM, interpolates any existing LM's predictions with the output of a $k$-nearest neighbors model. We empirically measure the effectiveness of our approach on two English language modeling datasets.
arXiv Detail & Related papers (2022-10-28T02:57:40Z)
Residual Learning of Neural Text Generation with $n$-gram Language Model [41.26228768053928]
We learn a neural LM that fits the residual between an $n$-gram LM and the real-data distribution. Our approach attains additional performance gains over popular standalone neural models consistently.
arXiv Detail & Related papers (2022-10-26T02:42:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.