Pragmatic Constraint on Distributional Semantics
- URL: http://arxiv.org/abs/2211.11041v1
- Date: Sun, 20 Nov 2022 17:51:06 GMT
- Title: Pragmatic Constraint on Distributional Semantics
- Authors: Elizaveta Zhemchuzhina and Nikolai Filippov and Ivan P. Yamshchikov
- Abstract summary: We show that Zipf-law token distribution emerges irrespective of the chosen tokenization.
We show that Zipf distribution is characterized by two distinct groups of tokens that differ both in terms of their frequency and their semantics.
- Score: 6.091096843566857
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper studies the limits of language models' statistical learning in the
context of Zipf's law. First, we demonstrate that Zipf-law token distribution
emerges irrespective of the chosen tokenization. Second, we show that Zipf
distribution is characterized by two distinct groups of tokens that differ both
in terms of their frequency and their semantics. Namely, the tokens that have a
one-to-one correspondence with one semantic concept have different statistical
properties than those with semantic ambiguity. Finally, we demonstrate how
these properties interfere with statistical learning procedures motivated by
distributional semantics.
Related papers
- Are LLMs Models of Distributional Semantics? A Case Study on Quantifiers [14.797001158310092]
We argue that distributional semantics models struggle with truth-conditional reasoning and symbolic processing.
Contrary to expectations, we find that LLMs align more closely with human judgements on exact quantifiers versus vague ones.
arXiv Detail & Related papers (2024-10-17T19:28:35Z) - The Foundations of Tokenization: Statistical and Computational Concerns [51.370165245628975]
Tokenization is a critical step in the NLP pipeline.
Despite its recognized importance as a standard representation method in NLP, the theoretical underpinnings of tokenization are not yet fully understood.
The present paper contributes to addressing this theoretical gap by proposing a unified formal framework for representing and analyzing tokenizer models.
arXiv Detail & Related papers (2024-07-16T11:12:28Z) - Beyond Demographic Parity: Redefining Equal Treatment [23.28973277699437]
We show the theoretical properties of our notion of equal treatment and devise a two-sample test based on the AUC of an equal treatment inspector.
We release textttexplanationspace, an open-source Python package with methods and tutorials.
arXiv Detail & Related papers (2023-03-14T16:19:44Z) - Learning versus Refutation in Noninteractive Local Differential Privacy [133.80204506727526]
We study two basic statistical tasks in non-interactive local differential privacy (LDP): learning and refutation.
Our main result is a complete characterization of the sample complexity of PAC learning for non-interactive LDP protocols.
arXiv Detail & Related papers (2022-10-26T03:19:24Z) - Label Uncertainty Modeling and Prediction for Speech Emotion Recognition
using t-Distributions [15.16865739526702]
We propose to model the label distribution using a Student's t-distribution.
We derive the corresponding Kullback-Leibler divergence based loss function and use it to train an estimator for the distribution of emotion labels.
Results reveal that our t-distribution based approach improves over the Gaussian approach with state-of-the-art uncertainty modeling results.
arXiv Detail & Related papers (2022-07-25T12:38:20Z) - Measuring Fairness of Text Classifiers via Prediction Sensitivity [63.56554964580627]
ACCUMULATED PREDICTION SENSITIVITY measures fairness in machine learning models based on the model's prediction sensitivity to perturbations in input features.
We show that the metric can be theoretically linked with a specific notion of group fairness (statistical parity) and individual fairness.
arXiv Detail & Related papers (2022-03-16T15:00:33Z) - Label Distribution Amendment with Emotional Semantic Correlations for
Facial Expression Recognition [69.18918567657757]
We propose a new method that amends the label distribution of each facial image by leveraging correlations among expressions in the semantic space.
By comparing semantic and task class-relation graphs of each image, the confidence of its label distribution is evaluated.
Experimental results demonstrate the proposed method is more effective than compared state-of-the-art methods.
arXiv Detail & Related papers (2021-07-23T07:46:14Z) - Distribution-Aware Semantics-Oriented Pseudo-label for Imbalanced
Semi-Supervised Learning [80.05441565830726]
This paper addresses imbalanced semi-supervised learning, where heavily biased pseudo-labels can harm the model performance.
We propose a general pseudo-labeling framework to address the bias motivated by this observation.
We term the novel pseudo-labeling framework for imbalanced SSL as Distribution-Aware Semantics-Oriented (DASO) Pseudo-label.
arXiv Detail & Related papers (2021-06-10T11:58:25Z) - On the Sentence Embeddings from Pre-trained Language Models [78.45172445684126]
In this paper, we argue that the semantic information in the BERT embeddings is not fully exploited.
We find that BERT always induces a non-smooth anisotropic semantic space of sentences, which harms its performance of semantic similarity.
We propose to transform the anisotropic sentence embedding distribution to a smooth and isotropic Gaussian distribution through normalizing flows that are learned with an unsupervised objective.
arXiv Detail & Related papers (2020-11-02T13:14:57Z) - The empirical structure of word frequency distributions [0.0]
I show that first names form natural communicative distributions in most languages.
I then show this pattern of findings replicates in communicative distributions of English nouns and verbs.
arXiv Detail & Related papers (2020-01-09T20:52:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.