Revisiting the Optimality of Word Lengths
- URL: http://arxiv.org/abs/2312.03897v1
- Date: Wed, 6 Dec 2023 20:41:47 GMT
- Title: Revisiting the Optimality of Word Lengths
- Authors: Tiago Pimentel, Clara Meister, Ethan Gotlieb Wilcox, Kyle Mahowald,
Ryan Cotterell
- Abstract summary: Communicative cost can be operationalized in different ways.
Zipf (1935) posited that wordforms are optimized to minimize utterances' communicative costs.
- Score: 92.70590105707639
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Zipf (1935) posited that wordforms are optimized to minimize utterances'
communicative costs. Under the assumption that cost is given by an utterance's
length, he supported this claim by showing that words' lengths are inversely
correlated with their frequencies. Communicative cost, however, can be
operationalized in different ways. Piantadosi et al. (2011) claim that cost
should be measured as the distance between an utterance's information rate and
channel capacity, which we dub the channel capacity hypothesis (CCH) here.
Following this logic, they then proposed that a word's length should be
proportional to the expected value of its surprisal (negative log-probability
in context). In this work, we show that Piantadosi et al.'s derivation does not
minimize CCH's cost, but rather a lower bound, which we term CCH-lower. We
propose a novel derivation, suggesting an improved way to minimize CCH's cost.
Under this method, we find that a language's word lengths should instead be
proportional to the surprisal's expectation plus its variance-to-mean ratio.
Experimentally, we compare these three communicative cost functions: Zipf's,
CCH-lower , and CCH. Across 13 languages and several experimental settings, we
find that length is better predicted by frequency than either of the other
hypotheses. In fact, when surprisal's expectation, or expectation plus
variance-to-mean ratio, is estimated using better language models, it leads to
worse word length predictions. We take these results as evidence that Zipf's
longstanding hypothesis holds.
Related papers
- Relative-Translation Invariant Wasserstein Distance [82.6068808353647]
We introduce a new family of distances, relative-translation invariant Wasserstein distances ($RW_p$)
We show that $RW_p distances are also real distance metrics defined on the quotient set $mathcalP_p(mathbbRn)/sim$ invariant to distribution translations.
arXiv Detail & Related papers (2024-09-04T03:41:44Z) - Uncertainty in Language Models: Assessment through Rank-Calibration [65.10149293133846]
Language Models (LMs) have shown promising performance in natural language generation.
It is crucial to correctly quantify their uncertainty in responding to given inputs.
We develop a novel and practical framework, termed $Rank$-$Calibration$, to assess uncertainty and confidence measures for LMs.
arXiv Detail & Related papers (2024-04-04T02:31:05Z) - Semantic Text Transmission via Prediction with Small Language Models:
Cost-Similarity Trade-off [7.666188363531336]
We exploit language's inherent correlations and predictability to constrain transmission costs by allowing the destination to predict or complete words.
We obtain $(barc, bars)$ pairs for neural language and first-order Markov chain-based small language models.
We demonstrate that, when communication occurs over a noiseless channel, the threshold policy achieves a higher $bars$ for a given $barc$ than the periodic policy.
arXiv Detail & Related papers (2024-03-01T05:20:16Z) - TIC-TAC: A Framework for Improved Covariance Estimation in Deep Heteroscedastic Regression [109.69084997173196]
Deepscedastic regression involves jointly optimizing the mean and covariance of the predicted distribution using the negative log-likelihood.
Recent works show that this may result in sub-optimal convergence due to the challenges associated with covariance estimation.
We study two questions: (1) Does the predicted covariance truly capture the randomness of the predicted mean?
Our results show that not only does TIC accurately learn the covariance, it additionally facilitates an improved convergence of the negative log-likelihood.
arXiv Detail & Related papers (2023-10-29T09:54:03Z) - Syntactic Surprisal From Neural Models Predicts, But Underestimates,
Human Processing Difficulty From Syntactic Ambiguities [19.659811811023374]
We propose a method for estimating syntactic predictability from a language model.
We find that treating syntactic predictability independently from lexical predictability indeed results in larger estimates of garden path effects.
Our results support the hypothesis that predictability is not the only factor responsible for the processing cost associated with garden path sentences.
arXiv Detail & Related papers (2022-10-21T18:30:56Z) - Doubly Robust Distributionally Robust Off-Policy Evaluation and Learning [59.02006924867438]
Off-policy evaluation and learning (OPE/L) use offline observational data to make better decisions.
Recent work proposed distributionally robust OPE/L (DROPE/L) to remedy this, but the proposal relies on inverse-propensity weighting.
We propose the first DR algorithms for DROPE/L with KL-divergence uncertainty sets.
arXiv Detail & Related papers (2022-02-19T20:00:44Z) - Dependency distance minimization predicts compression [1.2944868613449219]
Dependency distance minimization (DDm) is a well-established principle of word order.
This is a second order prediction because it links a principle with another principle, rather than a principle and a manifestation as in a first order prediction.
We use a recently introduced score that has many mathematical and statistical advantages with respect to the widely used sum of dependency distances.
arXiv Detail & Related papers (2021-09-18T10:53:39Z) - Linear-time calculation of the expected sum of edge lengths in random
projective linearizations of trees [1.2944868613449219]
The sum of distances between syntactically related words has been in the limelight for the past decades.
Various random baselines have been defined to carry out related quantitative studies on languages.
Here we focus on a popular baseline: random projective permutations of the words of the sentence.
arXiv Detail & Related papers (2021-07-07T15:11:53Z) - Robust Linear Regression: Optimal Rates in Polynomial Time [11.646151402884215]
We obtain robust and computationally efficient estimators for learning several linear models.
We identify an analytic condition that serves as a relaxation of independence of random variables.
Our central technical contribution is to algorithmically exploit independence of random variables in the "sum-of-squares" framework.
arXiv Detail & Related papers (2020-06-29T17:22:16Z) - An Analysis of the Adaptation Speed of Causal Models [80.77896315374747]
Recently, Bengio et al. conjectured that among all candidate models, $G$ is the fastest to adapt from one dataset to another.
We investigate the adaptation speed of cause-effect SCMs using convergence rates from optimization.
Surprisingly, we find situations where the anticausal model is advantaged, falsifying the initial hypothesis.
arXiv Detail & Related papers (2020-05-18T23:48:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.