A Statistical Model of Word Rank Evolution
- URL: http://arxiv.org/abs/2107.09948v1
- Date: Wed, 21 Jul 2021 08:57:32 GMT
- Title: A Statistical Model of Word Rank Evolution
- Authors: Alex John Quijano, Rick Dale, and Suzanne Sindi
- Abstract summary: This work explores the word rank dynamics of eight languages by investigating the Google Books corpus unigram frequency data set.
We observed the rank changes of the unigrams from 1900 to 2008 and compared it to a Wright-Fisher inspired model that we developed for our analysis.
- Score: 1.1011268090482575
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The availability of large linguistic data sets enables data-driven approaches
to study linguistic change. This work explores the word rank dynamics of eight
languages by investigating the Google Books corpus unigram frequency data set.
We observed the rank changes of the unigrams from 1900 to 2008 and compared it
to a Wright-Fisher inspired model that we developed for our analysis. The model
simulates a neutral evolutionary process with the restriction of having no
disappearing words. This work explains the mathematical framework of the model
- written as a Markov Chain with multinomial transition probabilities - to show
how frequencies of words change in time. From our observations in the data and
our model, word rank stability shows two types of characteristics: (1) the
increase/decrease in ranks are monotonic, or (2) the average rank stays the
same. Based on our model, high-ranked words tend to be more stable while
low-ranked words tend to be more volatile. Some words change in ranks in two
ways: (a) by an accumulation of small increasing/decreasing rank changes in
time and (b) by shocks of increase/decrease in ranks. Most of the stopwords and
Swadesh words are observed to be stable in ranks across eight languages. These
signatures suggest unigram frequencies in all languages have changed in a
manner inconsistent with a purely neutral evolutionary process.
Related papers
- Evaluating Semantic Variation in Text-to-Image Synthesis: A Causal Perspective [50.261681681643076]
We propose a novel metric called SemVarEffect and a benchmark named SemVarBench to evaluate the causality between semantic variations in inputs and outputs in text-to-image synthesis.
Our work establishes an effective evaluation framework that advances the T2I synthesis community's exploration of human instruction understanding.
arXiv Detail & Related papers (2024-10-14T08:45:35Z) - Not wacky vs. definitely wacky: A study of scalar adverbs in pretrained
language models [0.0]
Modern pretrained language models, such as BERT, RoBERTa and GPT-3 hold the promise of performing better on logical tasks than classic static word embeddings.
We investigate the extent to which BERT, RoBERTa, GPT-2 and GPT-3 exhibit general, human-like, knowledge of these common words.
We find that despite capturing some aspects of logical meaning, the models fall far short of human performance.
arXiv Detail & Related papers (2023-05-25T18:56:26Z) - Language statistics at different spatial, temporal, and grammatical
scales [48.7576911714538]
We use data from Twitter to explore the rank diversity at different scales.
The greatest changes come from variations in the grammatical scale.
As the grammatical scale grows, the rank diversity curves vary more depending on the temporal and spatial scales.
arXiv Detail & Related papers (2022-07-02T01:38:48Z) - Word Order Does Matter (And Shuffled Language Models Know It) [9.990431777927421]
Recent studies have shown that language models pretrained and/or fine-tuned on randomly permuted sentences exhibit competitive performance on GLUE.
We investigate what position embeddings learned from shuffled text encode, showing that these models retain information pertaining to the original, naturalistic word order.
arXiv Detail & Related papers (2022-03-21T14:10:15Z) - Revisiting Self-Training for Few-Shot Learning of Language Model [61.173976954360334]
Unlabeled data carry rich task-relevant information, they are proven useful for few-shot learning of language model.
In this work, we revisit the self-training technique for language model fine-tuning and present a state-of-the-art prompt-based few-shot learner, SFLM.
arXiv Detail & Related papers (2021-10-04T08:51:36Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - Word2rate: training and evaluating multiple word embeddings as
statistical transitions [4.350783459690612]
We introduce a novel left-right context split objective that improves performance for tasks sensitive to word order.
Our Word2rate model is grounded in a statistical foundation using rate matrices while being competitive in variety of language tasks.
arXiv Detail & Related papers (2021-04-16T15:31:29Z) - Fake it Till You Make it: Self-Supervised Semantic Shifts for
Monolingual Word Embedding Tasks [58.87961226278285]
We propose a self-supervised approach to model lexical semantic change.
We show that our method can be used for the detection of semantic change with any alignment method.
We illustrate the utility of our techniques using experimental results on three different datasets.
arXiv Detail & Related papers (2021-01-30T18:59:43Z) - Investigating Cross-Linguistic Adjective Ordering Tendencies with a
Latent-Variable Model [66.84264870118723]
We present the first purely corpus-driven model of multi-lingual adjective ordering in the form of a latent-variable model.
We provide strong converging evidence for the existence of universal, cross-linguistic, hierarchical adjective ordering tendencies.
arXiv Detail & Related papers (2020-10-09T18:27:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.