Go Forth and Prosper: Language Modeling with Ancient Textual History
- URL: http://arxiv.org/abs/2104.08742v1
- Date: Sun, 18 Apr 2021 06:57:30 GMT
- Title: Go Forth and Prosper: Language Modeling with Ancient Textual History
- Authors: Rik Koncel-Kedziorski and Noah A. Smith
- Abstract summary: We learn an auxiliary function to select spans from the ancient history which can help the LM to predict future text.
The selected text spans are then copied directly into the LM's context window, replacing less predictive spans.
We see a 7 percent perplexity reduction on Wikipedia articles, and a 12 percent perplexity reduction on scientific texts.
- Score: 54.99143450580711
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a technique for improving document-level language models (LM) by
leveraging "ancient history": text that is outside the LM's current context
window. We learn an auxiliary function to select spans from the ancient history
which can help the LM to predict future text. The selected text spans are then
copied directly into the LM's context window, replacing less predictive spans.
This method can improve perplexity of pretrained LMs with no updates to the
LM's own parameters. We further observe that an auxiliary function trained in a
specific textual domain like Wikipedia will also work in a substantially
different domain such as scientific publications. With this technique we see a
7 percent perplexity reduction on Wikipedia articles, and a 12 percent
perplexity reduction on scientific texts.
Related papers
- Domain Regeneration: How well do LLMs match syntactic properties of text domains? [19.04920427362747]
We prompt a commonly used, opensource LLM to regenerate text from two domains of permissively licensed English text -- Wikipedia and news text.<n>This regeneration paradigm allows us to investigate whether LLMs can faithfully match the original human text domains in a semantically-controlled setting.<n>We find that the majority of the regenerated distributions show a shifted mean, a lower standard deviation, and a reduction of the long tail, as compared to the human originals.
arXiv Detail & Related papers (2025-05-12T17:37:17Z) - Theoretical Proof that Generated Text in the Corpus Leads to the Collapse of Auto-regressive Language Models [26.117724170912552]
This paper presents theoretical proof that once a corpus (such as the World Wide Web) begins to incorporate generated text, LM collapse is bound to occur.
We express our concerns about the current situation in which an increasing amount of generated text may be used in LM training.
arXiv Detail & Related papers (2024-12-19T14:11:15Z) - Towards Aligning Language Models with Textual Feedback [43.55450701925131]
ALT (ALignment with Textual feedback) is an approach that aligns language models with user preferences expressed in text.
We explore the efficacy and efficiency of textual feedback across different tasks such as toxicity reduction, summarization, and dialog response generation.
arXiv Detail & Related papers (2024-07-24T03:32:05Z) - Evaluating $n$-Gram Novelty of Language Models Using Rusty-DAWG [57.14250086701313]
We investigate the extent to which modern LMs generate $n$-grams from their training data.
We develop Rusty-DAWG, a novel search tool inspired by indexing of genomic data.
arXiv Detail & Related papers (2024-06-18T21:31:19Z) - diff History for Neural Language Agents [33.13471417703669]
We introduce diff history, a simple and highly effective solution to these issues.
By applying the Unix diff command on consecutive text observations in the interaction histories used to prompt LM policies, we can both abstract away redundant information.
On NetHack, an unsolved video game that requires long-horizon reasoning for decision-making, LMs tuned with diff history match state-of-the-art performance for neural agents.
arXiv Detail & Related papers (2023-12-12T18:59:30Z) - Retrieval-Pretrained Transformer: Long-range Language Modeling with Self-retrieval [51.437420003471615]
We propose the Retrieval-Pretrained Transformer (RPT), an architecture and training procedure for jointly training a retrieval-augmented LM from scratch.
RPT improves retrieval quality and subsequently perplexity across the board compared to strong baselines.
arXiv Detail & Related papers (2023-06-23T10:18:02Z) - LeTI: Learning to Generate from Textual Interactions [60.425769582343506]
We explore LMs' potential to learn from textual interactions (LETI) that not only check their correctness with binary labels but also pinpoint and explain errors in their outputs through textual feedback.
Our focus is the code generation task, where the model produces code based on natural language instructions.
LETI iteratively fine-tunes the model, using the objective LM, on a concatenation of natural language instructions, LM-generated programs, and textual feedback.
arXiv Detail & Related papers (2023-05-17T15:53:31Z) - LAMP: Extracting Text from Gradients with Language Model Priors [9.242965489146398]
Recent work shows that sensitive user data can be reconstructed from gradient updates, breaking the key privacy promise of federated learning.
We propose LAMP, a novel attack tailored to textual data, that successfully reconstructs original text from gradients.
arXiv Detail & Related papers (2022-02-17T18:49:25Z) - Reusing a Pretrained Language Model on Languages with Limited Corpora
for Unsupervised NMT [129.99918589405675]
We present an effective approach that reuses an LM that is pretrained only on the high-resource language.
The monolingual LM is fine-tuned on both languages and is then used to initialize a UNMT model.
Our approach, RE-LM, outperforms a competitive cross-lingual pretraining model (XLM) in English-Macedonian (En-Mk) and English-Albanian (En-Sq)
arXiv Detail & Related papers (2020-09-16T11:37:10Z) - Enabling Language Models to Fill in the Blanks [81.59381915581892]
We present a simple approach for text infilling, the task of predicting missing spans of text at any position in a document.
We train (or fine-tune) off-the-shelf language models on sequences containing the concatenation of artificially-masked text and the text which was masked.
We show that this approach, which we call infilling by language modeling, can enable LMs to infill entire sentences effectively on three different domains: short stories, scientific abstracts, and lyrics.
arXiv Detail & Related papers (2020-05-11T18:00:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.