Universality and diversity in word patterns
- URL: http://arxiv.org/abs/2208.11175v1
- Date: Tue, 23 Aug 2022 20:03:27 GMT
- Title: Universality and diversity in word patterns
- Authors: David Sanchez and Luciano Zunino and Juan De Gregorio and Raul Toral
and Claudio Mirasso
- Abstract summary: We present an analysis of lexical statistical connections for eleven major languages.
We find that the diverse manners that languages utilize to express word relations give rise to unique pattern distributions.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Words are fundamental linguistic units that connect thoughts and things
through meaning. However, words do not appear independently in a text sequence.
The existence of syntactic rules induce correlations among neighboring words.
Further, words are not evenly distributed but approximately follow a power law
since terms with a pure semantic content appear much less often than terms that
specify grammar relations. Using an ordinal pattern approach, we present an
analysis of lexical statistical connections for eleven major languages. We find
that the diverse manners that languages utilize to express word relations give
rise to unique pattern distributions. Remarkably, we find that these relations
can be modeled with a Markov model of order 2 and that this result is
universally valid for all the studied languages. Furthermore, fluctuations of
the pattern distributions can allow us to determine the historical period when
the text was written and its author. Taken together, these results emphasize
the relevance of time series analysis and information-theoretic methods for the
understanding of statistical correlations in natural languages.
Related papers
- Entropy and type-token ratio in gigaword corpora [0.0]
We investigate entropy and text-token ratio, two metrics for lexical diversities, in six massive linguistic datasets in English, Spanish, and Turkish.
We find a functional relation between entropy and text-token ratio that holds across the corpora under consideration.
Our results contribute to the theoretical understanding of text structure and offer practical implications for fields like natural language processing.
arXiv Detail & Related papers (2024-11-15T14:40:59Z) - Investigating Idiomaticity in Word Representations [9.208145117062339]
We focus on noun compounds of varying levels of idiomaticity in two languages (English and Portuguese)
We present a dataset of minimal pairs containing human idiomaticity judgments for each noun compound at both type and token levels.
We define a set of fine-grained metrics of Affinity and Scaled Similarity to determine how sensitive the models are to perturbations that may lead to changes in idiomaticity.
arXiv Detail & Related papers (2024-11-04T21:05:01Z) - Complex systems approach to natural language [0.0]
Review summarizes the main methodological concepts used in studying natural language from the perspective of complexity science.
Three main complexity-related research trends in quantitative linguistics are covered.
arXiv Detail & Related papers (2024-01-05T12:01:26Z) - Autocorrelations Decay in Texts and Applicability Limits of Language
Models [0.0]
We empirically demonstrate that autocorrelations of words in texts decay according to a power law.
We show that distributional semantics provides coherent autocorrelations decay exponents for texts translated to multiple languages.
arXiv Detail & Related papers (2023-05-11T07:23:01Z) - Transparency Helps Reveal When Language Models Learn Meaning [71.96920839263457]
Our systematic experiments with synthetic data reveal that, with languages where all expressions have context-independent denotations, both autoregressive and masked language models learn to emulate semantic relations between expressions.
Turning to natural language, our experiments with a specific phenomenon -- referential opacity -- add to the growing body of evidence that current language models do not well-represent natural language semantics.
arXiv Detail & Related papers (2022-10-14T02:35:19Z) - On The Ingredients of an Effective Zero-shot Semantic Parser [95.01623036661468]
We analyze zero-shot learning by paraphrasing training examples of canonical utterances and programs from a grammar.
We propose bridging these gaps using improved grammars, stronger paraphrasers, and efficient learning methods.
Our model achieves strong performance on two semantic parsing benchmarks (Scholar, Geo) with zero labeled data.
arXiv Detail & Related papers (2021-10-15T21:41:16Z) - Patterns of Lexical Ambiguity in Contextualised Language Models [9.747449805791092]
We introduce an extended, human-annotated dataset of graded word sense similarity and co-predication.
Both types of human judgements indicate that the similarity of polysemic interpretations falls in a continuum between identity of meaning and homonymy.
Our dataset appears to capture a substantial part of the complexity of lexical ambiguity, and can provide a realistic test bed for contextualised embeddings.
arXiv Detail & Related papers (2021-09-27T13:11:44Z) - Language Model Evaluation Beyond Perplexity [47.268323020210175]
We analyze whether text generated from language models exhibits the statistical tendencies present in the human-generated text on which they were trained.
We find that neural language models appear to learn only a subset of the tendencies considered, but align much more closely with empirical trends than proposed theoretical distributions.
arXiv Detail & Related papers (2021-05-31T20:13:44Z) - Investigating Cross-Linguistic Adjective Ordering Tendencies with a
Latent-Variable Model [66.84264870118723]
We present the first purely corpus-driven model of multi-lingual adjective ordering in the form of a latent-variable model.
We provide strong converging evidence for the existence of universal, cross-linguistic, hierarchical adjective ordering tendencies.
arXiv Detail & Related papers (2020-10-09T18:27:55Z) - Speakers Fill Lexical Semantic Gaps with Context [65.08205006886591]
We operationalise the lexical ambiguity of a word as the entropy of meanings it can take.
We find significant correlations between our estimate of ambiguity and the number of synonyms a word has in WordNet.
This suggests that, in the presence of ambiguity, speakers compensate by making contexts more informative.
arXiv Detail & Related papers (2020-10-05T17:19:10Z) - Multiplex Word Embeddings for Selectional Preference Acquisition [70.33531759861111]
We propose a multiplex word embedding model, which can be easily extended according to various relations among words.
Our model can effectively distinguish words with respect to different relations without introducing unnecessary sparseness.
arXiv Detail & Related papers (2020-01-09T04:47:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.