Related papers: Heaps' law and Heaps functions in tagged texts: Evidences of their linguistic relevance

Heaps' law and Heaps functions in tagged texts: Evidences of their linguistic relevance

URL: http://arxiv.org/abs/2001.02178v1
Date: Tue, 7 Jan 2020 17:05:16 GMT
Title: Heaps' law and Heaps functions in tagged texts: Evidences of their linguistic relevance
Authors: Andr\'es Chacoma and Dami\'an H. Zanette
Abstract summary: We study the relationship between vocabulary size and text length in a corpus of $75$ literary works in English. We analyze the progressive appearance of new words of each tag along each individual text.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study the relationship between vocabulary size and text length in a corpus of $75$ literary works in English, authored by six writers, distinguishing between the contributions of three grammatical classes (or ``tags,'' namely, {\it nouns}, {\it verbs}, and {\it others}), and analyze the progressive appearance of new words of each tag along each individual text. While the power-law relation prescribed by Heaps' law is satisfactorily fulfilled by total vocabulary sizes and text lengths, the appearance of new words in each text is on the whole well described by the average of random shufflings of the text, which does not obey a power law. Deviations from this average, however, are statistically significant and show a systematic trend across the corpus. Specifically, they reveal that the appearance of new words along each text is predominantly retarded with respect to the average of random shufflings. Moreover, different tags are shown to add systematically distinct contributions to this tendency, with {\it verbs} and {\it others} being respectively more and less retarded than the mean trend, and {\it nouns} following instead this overall mean. These statistical systematicities are likely to point to the existence of linguistically relevant information stored in the different variants of Heaps' law, a feature that is still in need of extensive assessment.

Related papers

Entropy and type-token ratio in gigaword corpora [0.0]
We investigate entropy and text-token ratio, two metrics for lexical diversities, in six massive linguistic datasets in English, Spanish, and Turkish. We find a functional relation between entropy and text-token ratio that holds across the corpora under consideration. Our results contribute to the theoretical understanding of text structure and offer practical implications for fields like natural language processing.
arXiv Detail & Related papers (2024-11-15T14:40:59Z)
Quantifying the redundancy between prosody and text [67.07817268372743]
We use large language models to estimate how much information is redundant between prosody and the words themselves. We find a high degree of redundancy between the information carried by the words and prosodic information across several prosodic features. Still, we observe that prosodic features can not be fully predicted from text, suggesting that prosody carries information above and beyond the words.
arXiv Detail & Related papers (2023-11-28T21:15:24Z)
Dictionary-Assisted Supervised Contrastive Learning [0.0]
We introduce the dictionary-assisted supervised contrastive learning (DASCL) objective, allowing researchers to leverage specialized dictionaries. The text is first keyword simplified: a common, fixed token replaces any word in the corpus that appears in the dictionary(ies) relevant to the concept of interest. DASCL and cross-entropy improves classification performance metrics in few-shot learning settings and social science applications.
arXiv Detail & Related papers (2022-10-27T04:57:43Z)
Textual Entailment Recognition with Semantic Features from Empirical Text Representation [60.31047947815282]
A text entails a hypothesis if and only if the true value of the hypothesis follows the text. In this paper, we propose a novel approach to identifying the textual entailment relationship between text and hypothesis. We employ an element-wise Manhattan distance vector-based feature that can identify the semantic entailment relationship between the text-hypothesis pair.
arXiv Detail & Related papers (2022-10-18T10:03:51Z)
Universality and diversity in word patterns [0.0]
We present an analysis of lexical statistical connections for eleven major languages. We find that the diverse manners that languages utilize to express word relations give rise to unique pattern distributions.
arXiv Detail & Related papers (2022-08-23T20:03:27Z)
Simple, Interpretable and Stable Method for Detecting Words with Usage Change across Corpora [54.757845511368814]
The problem of comparing two bodies of text and searching for words that differ in their usage arises often in digital humanities and computational social science. This is commonly approached by training word embeddings on each corpus, aligning the vector spaces, and looking for words whose cosine distance in the aligned space is large. We propose an alternative approach that does not use vector space alignment, and instead considers the neighbors of each word.
arXiv Detail & Related papers (2021-12-28T23:46:00Z)
UCPhrase: Unsupervised Context-aware Quality Phrase Tagging [63.86606855524567]
UCPhrase is a novel unsupervised context-aware quality phrase tagger. We induce high-quality phrase spans as silver labels from consistently co-occurring word sequences. We show that our design is superior to state-of-the-art pre-trained, unsupervised, and distantly supervised methods.
arXiv Detail & Related papers (2021-05-28T19:44:24Z)
Word frequency-rank relationship in tagged texts [0.0]
We analyze the frequency-rank relationship in sub-vocabularies corresponding to three different grammatical classes. This results point to the fact that frequency-rank relationships may reflect linguistic features associated with grammatical function.
arXiv Detail & Related papers (2021-02-07T15:17:51Z)
Investigating Cross-Linguistic Adjective Ordering Tendencies with a Latent-Variable Model [66.84264870118723]
We present the first purely corpus-driven model of multi-lingual adjective ordering in the form of a latent-variable model. We provide strong converging evidence for the existence of universal, cross-linguistic, hierarchical adjective ordering tendencies.
arXiv Detail & Related papers (2020-10-09T18:27:55Z)
Speakers Fill Lexical Semantic Gaps with Context [65.08205006886591]
We operationalise the lexical ambiguity of a word as the entropy of meanings it can take. We find significant correlations between our estimate of ambiguity and the number of synonyms a word has in WordNet. This suggests that, in the presence of ambiguity, speakers compensate by making contexts more informative.
arXiv Detail & Related papers (2020-10-05T17:19:10Z)
Generalized Word Shift Graphs: A Method for Visualizing and Explaining Pairwise Comparisons Between Texts [0.15833270109954134]
A common task in computational text analyses is to quantify how two corpora differ according to a measurement like word frequency, sentiment, or information content. We introduce generalized word shift graphs, visualizations which yield a meaningful and interpretable summary of how individual words contribute to the variation between two texts. We show that this framework naturally encompasses many of the most commonly used approaches for comparing texts, including relative frequencies, dictionary scores, and entropy-based measures like the Kullback-Leibler and Jensen-Shannon divergences.
arXiv Detail & Related papers (2020-08-05T17:27:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.