Probabilistic Method of Measuring Linguistic Productivity
- URL: http://arxiv.org/abs/2308.12643v1
- Date: Thu, 24 Aug 2023 08:36:28 GMT
- Title: Probabilistic Method of Measuring Linguistic Productivity
- Authors: Sergei Monakhov
- Abstract summary: I propose a new way of measuring linguistic productivity that objectively assesses the ability of an affix to be used to coin new complex words.
token frequency does not dominate the productivity measure but naturally influences the sampling of bases.
A corpus-based approach and randomised design assure that true neologisms and words coined long ago have equal chances to be selected.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper I propose a new way of measuring linguistic productivity that
objectively assesses the ability of an affix to be used to coin new complex
words and, unlike other popular measures, is not directly dependent upon token
frequency. Specifically, I suggest that linguistic productivity may be viewed
as the probability of an affix to combine with a random base. The advantages of
this approach include the following. First, token frequency does not dominate
the productivity measure but naturally influences the sampling of bases.
Second, we are not just counting attested word types with an affix but rather
simulating the construction of these types and then checking whether they are
attested in the corpus. Third, a corpus-based approach and randomised design
assure that true neologisms and words coined long ago have equal chances to be
selected. The proposed algorithm is evaluated both on English and Russian data.
The obtained results provide some valuable insights into the relation of
linguistic productivity to the number of types and tokens. It looks like
burgeoning linguistic productivity manifests itself in an increasing number of
types. However, this process unfolds in two stages: first comes the increase in
high-frequency items, and only then follows the increase in low-frequency
items.
Related papers
- How to Compute the Probability of a Word [45.23856093235994]
This paper derives the correct methods for computing word probabilities.
We show that correcting the widespread bug in probability computations affects measured outcomes in sentence comprehension and lexical optimisation analyses.
arXiv Detail & Related papers (2024-06-20T17:59:42Z) - On the Usefulness of Embeddings, Clusters and Strings for Text Generator
Evaluation [86.19634542434711]
Mauve measures an information-theoretic divergence between two probability distributions over strings.
We show that Mauve was right for the wrong reasons, and that its newly proposed divergence is not necessary for its high performance.
We conclude that -- by encoding syntactic- and coherence-level features of text, while ignoring surface-level features -- such cluster-based substitutes to string distributions may simply be better for evaluating state-of-the-art language generators.
arXiv Detail & Related papers (2022-05-31T17:58:49Z) - On the probability-quality paradox in language generation [76.69397802617064]
We analyze language generation through an information-theoretic lens.
We posit that human-like language should contain an amount of information close to the entropy of the distribution over natural strings.
arXiv Detail & Related papers (2022-03-31T17:43:53Z) - Just Rank: Rethinking Evaluation with Word and Sentence Similarities [105.5541653811528]
intrinsic evaluation for embeddings lags far behind, and there has been no significant update since the past decade.
This paper first points out the problems using semantic similarity as the gold standard for word and sentence embedding evaluations.
We propose a new intrinsic evaluation method called EvalRank, which shows a much stronger correlation with downstream tasks.
arXiv Detail & Related papers (2022-03-05T08:40:05Z) - Deep Lexical Hypothesis: Identifying personality structure in natural
language [0.30458514384586394]
We introduce a method to extract adjective similarities from language models.
The correlational structure produced through this method is highly similar to that of self- and other-ratings of 435 terms reported by Saucier and Goldberg.
Notably, Neuroticism and Openness are only weakly and inconsistently recovered.
arXiv Detail & Related papers (2022-03-04T02:06:10Z) - Typical Decoding for Natural Language Generation [76.69397802617064]
We study why high-probability texts can be dull or repetitive.
We show that typical sampling offers competitive performance in terms of quality.
arXiv Detail & Related papers (2022-02-01T18:58:45Z) - You should evaluate your language model on marginal likelihood
overtokenisations [5.824498637088864]
We argue that language models should be evaluated on their marginal likelihood over tokenisations.
We evaluate pretrained English and German language models on both the one-best-tokenisation and marginal perplexities.
arXiv Detail & Related papers (2021-09-06T15:37:02Z) - Tweet Sentiment Quantification: An Experimental Re-Evaluation [88.60021378715636]
Sentiment quantification is the task of training, by means of supervised learning, estimators of the relative frequency (also called prevalence'') of sentiment-related classes.
We re-evaluate those quantification methods following a now consolidated and much more robust experimental protocol.
Results are dramatically different from those obtained by Gao Gao Sebastiani, and they provide a different, much more solid understanding of the relative strengths and weaknesses of different sentiment quantification methods.
arXiv Detail & Related papers (2020-11-04T21:41:34Z) - Are Some Words Worth More than Others? [3.5598388686985354]
We propose two new intrinsic evaluation measures within the framework of a simple word prediction task.
We evaluate several commonly-used large English language models using our proposed metrics.
arXiv Detail & Related papers (2020-10-12T23:12:11Z) - On the Importance of Word Order Information in Cross-lingual Sequence
Labeling [80.65425412067464]
Cross-lingual models that fit into the word order of the source language might fail to handle target languages.
We investigate whether making models insensitive to the word order of the source language can improve the adaptation performance in target languages.
arXiv Detail & Related papers (2020-01-30T03:35:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.