Estimating the Entropy of Linguistic Distributions
- URL: http://arxiv.org/abs/2204.01469v2
- Date: Tue, 5 Apr 2022 03:46:10 GMT
- Title: Estimating the Entropy of Linguistic Distributions
- Authors: Aryaman Arora, Clara Meister, Ryan Cotterell
- Abstract summary: We study the empirical effectiveness of different entropy estimators for linguistic distributions.
We find evidence that the reported effect size is over-estimated due to over-reliance on poor entropy estimators.
- Score: 75.20045001387685
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Shannon entropy is often a quantity of interest to linguists studying the
communicative capacity of human language. However, entropy must typically be
estimated from observed data because researchers do not have access to the
underlying probability distribution that gives rise to these data. While
entropy estimation is a well-studied problem in other fields, there is not yet
a comprehensive exploration of the efficacy of entropy estimators for use with
linguistic data. In this work, we fill this void, studying the empirical
effectiveness of different entropy estimators for linguistic distributions. In
a replication of two recent information-theoretic linguistic studies, we find
evidence that the reported effect size is over-estimated due to over-reliance
on poor entropy estimators. Finally, we end our paper with concrete
recommendations for entropy estimation depending on distribution type and data
availability.
Related papers
- InfoMatch: Entropy Neural Estimation for Semi-Supervised Image Classification [2.878018421751116]
We employ information entropy neural estimation to utilize the potential of unlabeled samples.
Inspired by contrastive learning, the entropy is estimated by maximizing a lower bound on mutual information.
We show our method's superior performance in extensive experiments.
arXiv Detail & Related papers (2024-04-17T02:29:44Z) - Estimating Unknown Population Sizes Using the Hypergeometric Distribution [1.03590082373586]
We tackle the challenge of estimating discrete distributions when both the total population size and the sizes of its constituent categories are unknown.
We develop our approach to account for a data generating process where the ground-truth is a mixture of distributions conditional on a continuous latent variable.
Empirical data simulation demonstrates that our method outperforms other likelihood functions used to model count data.
arXiv Detail & Related papers (2024-02-22T01:53:56Z) - Approximating Counterfactual Bounds while Fusing Observational, Biased
and Randomised Data Sources [64.96984404868411]
We address the problem of integrating data from multiple, possibly biased, observational and interventional studies.
We show that the likelihood of the available data has no local maxima.
We then show how the same approach can address the general case of multiple datasets.
arXiv Detail & Related papers (2023-07-31T11:28:24Z) - Revisiting Entropy Rate Constancy in Text [43.928576088761844]
The uniform information density hypothesis states that humans tend to distribute information roughly evenly across an utterance or discourse.
We re-evaluate the claims of Genzel & Charniak (2002) with neural language models, failing to find clear evidence in support of entropy rate constancy.
arXiv Detail & Related papers (2023-05-20T03:48:31Z) - Statistical Properties of the Entropy from Ordinal Patterns [55.551675080361335]
Knowing the joint distribution of the pair Entropy-Statistical Complexity for a large class of time series models would allow statistical tests that are unavailable to date.
We characterize the distribution of the empirical Shannon's Entropy for any model under which the true normalized Entropy is neither zero nor one.
We present a bilateral test that verifies if there is enough evidence to reject the hypothesis that two signals produce ordinal patterns with the same Shannon's Entropy.
arXiv Detail & Related papers (2022-09-15T23:55:58Z) - On the probability-quality paradox in language generation [76.69397802617064]
We analyze language generation through an information-theoretic lens.
We posit that human-like language should contain an amount of information close to the entropy of the distribution over natural strings.
arXiv Detail & Related papers (2022-03-31T17:43:53Z) - Automatically Identifying Semantic Bias in Crowdsourced Natural Language
Inference Datasets [78.6856732729301]
We introduce a model-driven, unsupervised technique to find "bias clusters" in a learned embedding space of hypotheses in NLI datasets.
interventions and additional rounds of labeling can be performed to ameliorate the semantic bias of the hypothesis distribution of a dataset.
arXiv Detail & Related papers (2021-12-16T22:49:01Z) - Entropic Causal Inference: Identifiability and Finite Sample Results [14.495984877053948]
Entropic causal inference is a framework for inferring the causal direction between two categorical variables from observational data.
We consider the minimum entropy coupling-based algorithmic approach presented by Kocaoglu et al.
arXiv Detail & Related papers (2021-01-10T08:37:54Z) - Neural Joint Entropy Estimation [12.77733789371855]
Estimating the entropy of a discrete random variable is a fundamental problem in information theory and related fields.
In this work, we introduce a practical solution to this problem, which extends the work of McAllester and Statos ( 2020)
The proposed scheme uses the generalization abilities of cross-entropy estimation in deep neural networks (DNNs) to introduce improved entropy estimation accuracy.
arXiv Detail & Related papers (2020-12-21T09:23:39Z) - Generalized Entropy Regularization or: There's Nothing Special about
Label Smoothing [83.78668073898001]
We introduce a family of entropy regularizers, which includes label smoothing as a special case.
We find that variance in model performance can be explained largely by the resulting entropy of the model.
We advise the use of other entropy regularization methods in its place.
arXiv Detail & Related papers (2020-05-02T12:46:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.