How (Non-)Optimal is the Lexicon?
- URL: http://arxiv.org/abs/2104.14279v2
- Date: Fri, 30 Apr 2021 19:46:59 GMT
- Title: How (Non-)Optimal is the Lexicon?
- Authors: Tiago Pimentel, Irene Nikkarinen, Kyle Mahowald, Ryan Cotterell,
Dami\'an Blasi
- Abstract summary: We take a coding-theoretic view of the lexicon and make use of a novel generative statistical model.
Examining corpora from 7 typologically diverse languages, we use those upper bounds to quantify the lexicon's optimality.
We find that (compositional) morphology and graphotactics can sufficiently account for most of the complexity of natural codes.
- Score: 35.91590073820011
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The mapping of lexical meanings to wordforms is a major feature of natural
languages. While usage pressures might assign short words to frequent meanings
(Zipf's law of abbreviation), the need for a productive and open-ended
vocabulary, local constraints on sequences of symbols, and various other
factors all shape the lexicons of the world's languages. Despite their
importance in shaping lexical structure, the relative contributions of these
factors have not been fully quantified. Taking a coding-theoretic view of the
lexicon and making use of a novel generative statistical model, we define upper
bounds for the compressibility of the lexicon under various constraints.
Examining corpora from 7 typologically diverse languages, we use those upper
bounds to quantify the lexicon's optimality and to explore the relative costs
of major constraints on natural codes. We find that (compositional) morphology
and graphotactics can sufficiently account for most of the complexity of
natural codes -- as measured by code length.
Related papers
- Entropy and type-token ratio in gigaword corpora [0.0]
We investigate entropy and text-token ratio, two metrics for lexical diversities, in six massive linguistic datasets in English, Spanish, and Turkish.
We find a functional relation between entropy and text-token ratio that holds across the corpora under consideration.
Our results contribute to the theoretical understanding of text structure and offer practical implications for fields like natural language processing.
arXiv Detail & Related papers (2024-11-15T14:40:59Z) - Lexinvariant Language Models [84.2829117441298]
Token embeddings, a mapping from discrete lexical symbols to continuous vectors, are at the heart of any language model (LM)
We study textitlexinvariantlanguage models that are invariant to lexical symbols and therefore do not need fixed token embeddings in practice.
We show that a lexinvariant LM can attain perplexity comparable to that of a standard language model, given a sufficiently long context.
arXiv Detail & Related papers (2023-05-24T19:10:46Z) - Representing Interlingual Meaning in Lexical Databases [5.654039329474587]
We show that existing lexical databases have structural limitations that result in a reduced expressivity on culturally-specific words.
In particular, the lexical meaning space of dominant languages, such as English, is represented more accurately while linguistically or culturally diverse languages are mapped in an approximate manner.
arXiv Detail & Related papers (2023-01-22T17:41:29Z) - Local Grammar-Based Coding Revisited [0.0]
In minimal local grammar-based coding, the input string is represented as a grammar with the minimal output length defined.
We invoke a simple harmonic bound on ranked probabilities, which reminds Zipf's law.
We refine known bounds on the vocabulary size, showing its partial power-law equivalence with mutual information and redundancy.
We analyze grammar-based codes with finite vocabularies being empirical rank lists, proving that such codes are also universal.
arXiv Detail & Related papers (2022-09-27T19:05:22Z) - Lexically-constrained Text Generation through Commonsense Knowledge
Extraction and Injection [62.071938098215085]
We focus on the Commongen benchmark, wherein the aim is to generate a plausible sentence for a given set of input concepts.
We propose strategies for enhancing the semantic correctness of the generated text.
arXiv Detail & Related papers (2020-12-19T23:23:40Z) - Speakers Fill Lexical Semantic Gaps with Context [65.08205006886591]
We operationalise the lexical ambiguity of a word as the entropy of meanings it can take.
We find significant correlations between our estimate of ambiguity and the number of synonyms a word has in WordNet.
This suggests that, in the presence of ambiguity, speakers compensate by making contexts more informative.
arXiv Detail & Related papers (2020-10-05T17:19:10Z) - A Broad-Coverage Deep Semantic Lexicon for Verbs [3.219005794369446]
COLLIE-V is a deep lexical resource for verbs with the coverage of WordNet and semantic details that meet or exceed existing resources.
New ontological concepts and lexical entries, together with semantic role preferences and entailment axioms, are automatically derived.
arXiv Detail & Related papers (2020-07-06T12:03:14Z) - Neural Polysynthetic Language Modelling [15.257624461339867]
In high-resource languages, a common approach is to treat morphologically-distinct variants of a common root as completely independent word types.
This assumes, that there are limited inflections per root, and that the majority will appear in a large enough corpus.
We examine the current state-of-the-art in language modelling, machine translation, and text prediction for four polysynthetic languages.
arXiv Detail & Related papers (2020-05-11T22:57:04Z) - Linguistic Typology Features from Text: Inferring the Sparse Features of
World Atlas of Language Structures [73.06435180872293]
We construct a recurrent neural network predictor based on byte embeddings and convolutional layers.
We show that some features from various linguistic types can be predicted reliably.
arXiv Detail & Related papers (2020-04-30T21:00:53Z) - Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual
Lexical Semantic Similarity [67.36239720463657]
Multi-SimLex is a large-scale lexical resource and evaluation benchmark covering datasets for 12 diverse languages.
Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs.
Owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity datasets.
arXiv Detail & Related papers (2020-03-10T17:17:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.