Heaps' Law in GPT-Neo Large Language Model Emulated Corpora
- URL: http://arxiv.org/abs/2311.06377v1
- Date: Fri, 10 Nov 2023 20:07:32 GMT
- Title: Heaps' Law in GPT-Neo Large Language Model Emulated Corpora
- Authors: Uyen Lai, Gurjit S. Randhawa, Paul Sheridan
- Abstract summary: Heaps' law is an empirical relation in text analysis that predicts vocabulary growth as a function of corpus size.
This study focuses on the emulation of corpora using the suite of GPT-Neo large language models.
- Score: 2.7234916145234713
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Heaps' law is an empirical relation in text analysis that predicts vocabulary
growth as a function of corpus size. While this law has been validated in
diverse human-authored text corpora, its applicability to large language model
generated text remains unexplored. This study addresses this gap, focusing on
the emulation of corpora using the suite of GPT-Neo large language models. To
conduct our investigation, we emulated corpora of PubMed abstracts using three
different parameter sizes of the GPT-Neo model. Our emulation strategy involved
using the initial five words of each PubMed abstract as a prompt and
instructing the model to expand the content up to the original abstract's
length. Our findings indicate that the generated corpora adhere to Heaps' law.
Interestingly, as the GPT-Neo model size grows, its generated vocabulary
increasingly adheres to Heaps' law as as observed in human-authored text. To
further improve the richness and authenticity of GPT-Neo outputs, future
iterations could emphasize enhancing model size or refining the model
architecture to curtail vocabulary repetition.
Related papers
- OWLS: Scaling Laws for Multilingual Speech Recognition and Translation Models [55.63479003621053]
We introduce OWLS, an open-access suite of multilingual speech recognition and translation models.
We use OWLS to derive neural scaling laws, showing how final performance can be reliably predicted when scaling.
We show how OWLS can be used to power new research directions by discovering emergent abilities in large-scale speech models.
arXiv Detail & Related papers (2025-02-14T18:51:40Z) - Large corpora and large language models: a replicable method for automating grammatical annotation [0.0]
We introduce a methodological pipeline applied to the case study of formal variation in the English evaluative verb construction 'consider X (as) (to be) Y'
We reach a model accuracy of over 90% on our held-out test samples with only a small amount of training data.
We discuss the generalisability of our results for a wider range of case studies of grammatical constructions and grammatical variation and change.
arXiv Detail & Related papers (2024-11-18T03:29:48Z) - JAMDEC: Unsupervised Authorship Obfuscation using Constrained Decoding
over Small Language Models [53.83273575102087]
We propose an unsupervised inference-time approach to authorship obfuscation.
We introduce JAMDEC, a user-controlled, inference-time algorithm for authorship obfuscation.
Our approach builds on small language models such as GPT2-XL in order to help avoid disclosing the original content to proprietary LLM's APIs.
arXiv Detail & Related papers (2024-02-13T19:54:29Z) - Generative Spoken Language Model based on continuous word-sized audio
tokens [52.081868603603844]
We introduce a Generative Spoken Language Model based on word-size continuous-valued audio embeddings.
The resulting model is the first generative language model based on word-size continuous embeddings.
arXiv Detail & Related papers (2023-10-08T16:46:14Z) - Do Large GPT Models Discover Moral Dimensions in Language
Representations? A Topological Study Of Sentence Embeddings [0.7416846035207727]
We take a look at the topological structure of neuronal activity in the "brain" of Chat-GPT's foundation language model, and analyze it with respect to a metric representing the notion of fairness.
We first compute a fairness metric, inspired by social literature, to identify factors that typically influence fairness assessments in humans, such as legitimacy, need, and responsibility.
Our results show that sentence embeddings based on GPT-3.5 can be decomposed into two submanifolds corresponding to fair and unfair moral judgments.
arXiv Detail & Related papers (2023-09-17T23:38:39Z) - Galactic ChitChat: Using Large Language Models to Converse with
Astronomy Literature [0.0]
We demonstrate the potential of the state-of-the-art OpenAI GPT-4 large language model to engage in meaningful interactions with Astronomy papers.
We employ a distillation technique that effectively reduces the size of the original input paper by 50%.
We then explore the model's responses using a multi-document context.
arXiv Detail & Related papers (2023-04-12T03:02:20Z) - Retrieval augmentation of large language models for lay language
generation [12.686922203465896]
We introduce CELLS, the largest (63k pairs) and broadest-ranging (12 journals) parallel corpus for lay language generation.
The abstract and the corresponding lay language summary are written by domain experts, assuring the quality of our dataset.
We derive two specialized paired corpora from CELLS to address key challenges in lay language generation: generating background explanations and simplifying the original abstract.
arXiv Detail & Related papers (2022-11-07T19:06:53Z) - How much do language models copy from their training data? Evaluating
linguistic novelty in text generation using RAVEN [63.79300884115027]
Current language models can generate high-quality text.
Are they simply copying text they have seen before, or have they learned generalizable linguistic abstractions?
We introduce RAVEN, a suite of analyses for assessing the novelty of generated text.
arXiv Detail & Related papers (2021-11-18T04:07:09Z) - Pre-training Language Model Incorporating Domain-specific Heterogeneous Knowledge into A Unified Representation [49.89831914386982]
We propose a unified pre-trained language model (PLM) for all forms of text, including unstructured text, semi-structured text, and well-structured text.
Our approach outperforms the pre-training of plain text using only 1/4 of the data.
arXiv Detail & Related papers (2021-09-02T16:05:24Z) - Corpus-Based Paraphrase Detection Experiments and Review [0.0]
Paraphrase detection is important for a number of applications, including plagiarism detection, authorship attribution, question answering, text summarization, etc.
In this paper, we give a performance overview of various types of corpus-based models, especially deep learning (DL) models, with the task of paraphrase detection.
arXiv Detail & Related papers (2021-05-31T23:29:24Z) - GPT3Mix: Leveraging Large-scale Language Models for Text Augmentation [9.501648136713694]
Large-scale language models such as GPT-3 are excellent few-shot learners, allowing them to be controlled via natural text prompts.
This paper proposes a novel data augmentation technique that leverages large-scale language models to generate realistic text samples.
arXiv Detail & Related papers (2021-04-18T11:39:33Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z) - Progressive Generation of Long Text with Pretrained Language Models [83.62523163717448]
Large-scale language models (LMs) pretrained on massive corpora of text, such as GPT-2, are powerful open-domain text generators.
It is still challenging for such models to generate coherent long passages of text, especially when the models are fine-tuned to the target domain on a small corpus.
We propose a simple but effective method of generating text in a progressive manner, inspired by generating images from low to high resolution.
arXiv Detail & Related papers (2020-06-28T21:23:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.