Autocorrelations Decay in Texts and Applicability Limits of Language
Models
- URL: http://arxiv.org/abs/2305.06615v1
- Date: Thu, 11 May 2023 07:23:01 GMT
- Title: Autocorrelations Decay in Texts and Applicability Limits of Language
Models
- Authors: Nikolay Mikhaylovskiy and Ilya Churilov
- Abstract summary: We empirically demonstrate that autocorrelations of words in texts decay according to a power law.
We show that distributional semantics provides coherent autocorrelations decay exponents for texts translated to multiple languages.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We show that the laws of autocorrelations decay in texts are closely related
to applicability limits of language models. Using distributional semantics we
empirically demonstrate that autocorrelations of words in texts decay according
to a power law. We show that distributional semantics provides coherent
autocorrelations decay exponents for texts translated to multiple languages.
The autocorrelations decay in generated texts is quantitatively and often
qualitatively different from the literary texts. We conclude that language
models exhibiting Markov behavior, including large autoregressive language
models, may have limitations when applied to long texts, whether analysis or
generation.
Related papers
- Quantifying the Plausibility of Context Reliance in Neural Machine
Translation [25.29330352252055]
We introduce Plausibility Evaluation of Context Reliance (PECoRe)
PECoRe is an end-to-end interpretability framework designed to quantify context usage in language models' generations.
We use pecore to quantify the plausibility of context-aware machine translation models.
arXiv Detail & Related papers (2023-10-02T13:26:43Z) - Natural Language Decompositions of Implicit Content Enable Better Text
Representations [56.85319224208865]
We introduce a method for the analysis of text that takes implicitly communicated content explicitly into account.
We use a large language model to produce sets of propositions that are inferentially related to the text that has been observed.
Our results suggest that modeling the meanings behind observed language, rather than the literal text alone, is a valuable direction for NLP.
arXiv Detail & Related papers (2023-05-23T23:45:20Z) - Model Criticism for Long-Form Text Generation [113.13900836015122]
We apply a statistical tool, model criticism in latent space, to evaluate the high-level structure of generated text.
We perform experiments on three representative aspects of high-level discourse -- coherence, coreference, and topicality.
We find that transformer-based language models are able to capture topical structures but have a harder time maintaining structural coherence or modeling coreference.
arXiv Detail & Related papers (2022-10-16T04:35:58Z) - Universality and diversity in word patterns [0.0]
We present an analysis of lexical statistical connections for eleven major languages.
We find that the diverse manners that languages utilize to express word relations give rise to unique pattern distributions.
arXiv Detail & Related papers (2022-08-23T20:03:27Z) - How much do language models copy from their training data? Evaluating
linguistic novelty in text generation using RAVEN [63.79300884115027]
Current language models can generate high-quality text.
Are they simply copying text they have seen before, or have they learned generalizable linguistic abstractions?
We introduce RAVEN, a suite of analyses for assessing the novelty of generated text.
arXiv Detail & Related papers (2021-11-18T04:07:09Z) - Comparative Error Analysis in Neural and Finite-state Models for
Unsupervised Character-level Transduction [34.1177259741046]
We compare the two model classes side by side and find that they tend to make different types of errors even when achieving comparable performance.
We investigate how combining finite-state and sequence-to-sequence models at decoding time affects the output quantitatively and qualitatively.
arXiv Detail & Related papers (2021-06-24T00:09:24Z) - Language Model Evaluation Beyond Perplexity [47.268323020210175]
We analyze whether text generated from language models exhibits the statistical tendencies present in the human-generated text on which they were trained.
We find that neural language models appear to learn only a subset of the tendencies considered, but align much more closely with empirical trends than proposed theoretical distributions.
arXiv Detail & Related papers (2021-05-31T20:13:44Z) - Lexically-constrained Text Generation through Commonsense Knowledge
Extraction and Injection [62.071938098215085]
We focus on the Commongen benchmark, wherein the aim is to generate a plausible sentence for a given set of input concepts.
We propose strategies for enhancing the semantic correctness of the generated text.
arXiv Detail & Related papers (2020-12-19T23:23:40Z) - Limits of Detecting Text Generated by Large-Scale Language Models [65.46403462928319]
Some consider large-scale language models that can generate long and coherent pieces of text as dangerous, since they may be used in misinformation campaigns.
Here we formulate large-scale language model output detection as a hypothesis testing problem to classify text as genuine or generated.
arXiv Detail & Related papers (2020-02-09T19:53:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.