Heaps' law and Heaps functions in tagged texts: Evidences of their
linguistic relevance
- URL: http://arxiv.org/abs/2001.02178v1
- Date: Tue, 7 Jan 2020 17:05:16 GMT
- Title: Heaps' law and Heaps functions in tagged texts: Evidences of their
linguistic relevance
- Authors: Andr\'es Chacoma and Dami\'an H. Zanette
- Abstract summary: We study the relationship between vocabulary size and text length in a corpus of $75$ literary works in English.
We analyze the progressive appearance of new words of each tag along each individual text.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the relationship between vocabulary size and text length in a corpus
of $75$ literary works in English, authored by six writers, distinguishing
between the contributions of three grammatical classes (or ``tags,'' namely,
{\it nouns}, {\it verbs}, and {\it others}), and analyze the progressive
appearance of new words of each tag along each individual text. While the
power-law relation prescribed by Heaps' law is satisfactorily fulfilled by
total vocabulary sizes and text lengths, the appearance of new words in each
text is on the whole well described by the average of random shufflings of the
text, which does not obey a power law. Deviations from this average, however,
are statistically significant and show a systematic trend across the corpus.
Specifically, they reveal that the appearance of new words along each text is
predominantly retarded with respect to the average of random shufflings.
Moreover, different tags are shown to add systematically distinct contributions
to this tendency, with {\it verbs} and {\it others} being respectively more and
less retarded than the mean trend, and {\it nouns} following instead this
overall mean. These statistical systematicities are likely to point to the
existence of linguistically relevant information stored in the different
variants of Heaps' law, a feature that is still in need of extensive
assessment.
Related papers
- Entropy and type-token ratio in gigaword corpora [0.0]
We investigate entropy and text-token ratio, two metrics for lexical diversities, in six massive linguistic datasets in English, Spanish, and Turkish.
We find a functional relation between entropy and text-token ratio that holds across the corpora under consideration.
Our results contribute to the theoretical understanding of text structure and offer practical implications for fields like natural language processing.
arXiv Detail & Related papers (2024-11-15T14:40:59Z) - Quantifying the redundancy between prosody and text [67.07817268372743]
We use large language models to estimate how much information is redundant between prosody and the words themselves.
We find a high degree of redundancy between the information carried by the words and prosodic information across several prosodic features.
Still, we observe that prosodic features can not be fully predicted from text, suggesting that prosody carries information above and beyond the words.
arXiv Detail & Related papers (2023-11-28T21:15:24Z) - Textual Entailment Recognition with Semantic Features from Empirical
Text Representation [60.31047947815282]
A text entails a hypothesis if and only if the true value of the hypothesis follows the text.
In this paper, we propose a novel approach to identifying the textual entailment relationship between text and hypothesis.
We employ an element-wise Manhattan distance vector-based feature that can identify the semantic entailment relationship between the text-hypothesis pair.
arXiv Detail & Related papers (2022-10-18T10:03:51Z) - Universality and diversity in word patterns [0.0]
We present an analysis of lexical statistical connections for eleven major languages.
We find that the diverse manners that languages utilize to express word relations give rise to unique pattern distributions.
arXiv Detail & Related papers (2022-08-23T20:03:27Z) - Simple, Interpretable and Stable Method for Detecting Words with Usage
Change across Corpora [54.757845511368814]
The problem of comparing two bodies of text and searching for words that differ in their usage arises often in digital humanities and computational social science.
This is commonly approached by training word embeddings on each corpus, aligning the vector spaces, and looking for words whose cosine distance in the aligned space is large.
We propose an alternative approach that does not use vector space alignment, and instead considers the neighbors of each word.
arXiv Detail & Related papers (2021-12-28T23:46:00Z) - UCPhrase: Unsupervised Context-aware Quality Phrase Tagging [63.86606855524567]
UCPhrase is a novel unsupervised context-aware quality phrase tagger.
We induce high-quality phrase spans as silver labels from consistently co-occurring word sequences.
We show that our design is superior to state-of-the-art pre-trained, unsupervised, and distantly supervised methods.
arXiv Detail & Related papers (2021-05-28T19:44:24Z) - Word frequency-rank relationship in tagged texts [0.0]
We analyze the frequency-rank relationship in sub-vocabularies corresponding to three different grammatical classes.
This results point to the fact that frequency-rank relationships may reflect linguistic features associated with grammatical function.
arXiv Detail & Related papers (2021-02-07T15:17:51Z) - Investigating Cross-Linguistic Adjective Ordering Tendencies with a
Latent-Variable Model [66.84264870118723]
We present the first purely corpus-driven model of multi-lingual adjective ordering in the form of a latent-variable model.
We provide strong converging evidence for the existence of universal, cross-linguistic, hierarchical adjective ordering tendencies.
arXiv Detail & Related papers (2020-10-09T18:27:55Z) - Speakers Fill Lexical Semantic Gaps with Context [65.08205006886591]
We operationalise the lexical ambiguity of a word as the entropy of meanings it can take.
We find significant correlations between our estimate of ambiguity and the number of synonyms a word has in WordNet.
This suggests that, in the presence of ambiguity, speakers compensate by making contexts more informative.
arXiv Detail & Related papers (2020-10-05T17:19:10Z) - Generalized Word Shift Graphs: A Method for Visualizing and Explaining
Pairwise Comparisons Between Texts [0.15833270109954134]
A common task in computational text analyses is to quantify how two corpora differ according to a measurement like word frequency, sentiment, or information content.
We introduce generalized word shift graphs, visualizations which yield a meaningful and interpretable summary of how individual words contribute to the variation between two texts.
We show that this framework naturally encompasses many of the most commonly used approaches for comparing texts, including relative frequencies, dictionary scores, and entropy-based measures like the Kullback-Leibler and Jensen-Shannon divergences.
arXiv Detail & Related papers (2020-08-05T17:27:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.