Evolution of Efficient Symbolic Communication Codes
- URL: http://arxiv.org/abs/2306.02383v2
- Date: Sun, 11 Jun 2023 06:38:03 GMT
- Title: Evolution of Efficient Symbolic Communication Codes
- Authors: Anton Kolonin
- Abstract summary: The paper explores how the human natural language structure can be seen as a product of evolution of inter-personal communication code.
It aims to maximise such culture-agnostic and cross-lingual metrics such as anti-entropy, compression factor and cross-split F1 score.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The paper explores how the human natural language structure can be seen as a
product of evolution of inter-personal communication code, targeting
maximisation of such culture-agnostic and cross-lingual metrics such as
anti-entropy, compression factor and cross-split F1 score. The exploration is
done as part of a larger unsupervised language learning effort, the attempt is
made to perform meta-learning in a space of hyper-parameters maximising F1
score based on the "ground truth" language structure, by means of maximising
the metrics mentioned above. The paper presents preliminary results of
cross-lingual word-level segmentation tokenisation study for Russian, Chinese
and English as well as subword segmentation or morphological parsing study for
English. It is found that language structure form the word-level segmentation
or tokenisation can be found as driven by all of these metrics, anti-entropy
being more relevant to English and Russian while compression factor more
specific for Chinese. The study for subword segmentation or morphological
parsing on English lexicon has revealed straight connection between the
compression been found to be associated with compression factor, while,
surprising, the same connection with anti-entropy has turned to be the inverse.
Related papers
- Parsing Through Boundaries in Chinese Word Segmentation [4.74872130711676]
Unlike English, Chinese lacks explicit word boundaries, making segmentation both necessary and inherently ambiguous.
This study highlights the intricate relationship between word segmentation and syntactic parsing, providing a clearer understanding of how different segmentation strategies shape dependency structures in Chinese.
arXiv Detail & Related papers (2025-03-29T14:24:02Z) - Entropy and type-token ratio in gigaword corpora [0.0]
We investigate entropy and text-token ratio, two metrics for lexical diversities, in six massive linguistic datasets in English, Spanish, and Turkish.
We find a functional relation between entropy and text-token ratio that holds across the corpora under consideration.
Our results contribute to the theoretical understanding of text structure and offer practical implications for fields like natural language processing.
arXiv Detail & Related papers (2024-11-15T14:40:59Z) - Linguistic Structure from a Bottleneck on Sequential Information Processing [5.850665541267672]
We show that natural-language-like systematicity arises in codes that are constrained by predictive information.
We show that human languages are structured to have low predictive information at the levels of phonology, morphology, syntax, and semantics.
arXiv Detail & Related papers (2024-05-20T15:25:18Z) - How Important Is Tokenization in French Medical Masked Language Models? [7.866517623371908]
Subword tokenization has become the prevailing standard in the field of natural language processing (NLP)
This paper seeks to delve into the complexities of subword tokenization in French biomedical domain across a variety of NLP tasks.
We introduce an original tokenization strategy that integrates morpheme-enriched word segmentation into existing tokenization methods.
arXiv Detail & Related papers (2024-02-22T23:11:08Z) - Syntactic Language Change in English and German: Metrics, Parsers, and Convergences [56.47832275431858]
The current paper looks at diachronic trends in syntactic language change in both English and German, using corpora of parliamentary debates from the last c. 160 years.
We base our observations on five dependencys, including the widely used Stanford Core as well as 4 newer alternatives.
We show that changes in syntactic measures seem to be more frequent at the tails of sentence length distributions.
arXiv Detail & Related papers (2024-02-18T11:46:16Z) - VECO 2.0: Cross-lingual Language Model Pre-training with
Multi-granularity Contrastive Learning [56.47303426167584]
We propose a cross-lingual pre-trained model VECO2.0 based on contrastive learning with multi-granularity alignments.
Specifically, the sequence-to-sequence alignment is induced to maximize the similarity of the parallel pairs and minimize the non-parallel pairs.
token-to-token alignment is integrated to bridge the gap between synonymous tokens excavated via the thesaurus dictionary from the other unpaired tokens in a bilingual instance.
arXiv Detail & Related papers (2023-04-17T12:23:41Z) - Self-tuning hyper-parameters for unsupervised cross-lingual tokenization [0.0]
We implement the meta-learning approach for automatic determination of hyper- parameters of the unsupervised tokenization model.
We find a fairly good correlation between the additive combination of the former three metrics for English and Russian.
In case of Chinese, we find a significant correlation between the F 1 score and the compression factor.
arXiv Detail & Related papers (2023-03-04T14:23:02Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - More Than Words: Collocation Tokenization for Latent Dirichlet
Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ.
We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z) - VECO: Variable and Flexible Cross-lingual Pre-training for Language
Understanding and Generation [77.82373082024934]
We plug a cross-attention module into the Transformer encoder to explicitly build the interdependence between languages.
It can effectively avoid the degeneration of predicting masked words only conditioned on the context in its own language.
The proposed cross-lingual model delivers new state-of-the-art results on various cross-lingual understanding tasks of the XTREME benchmark.
arXiv Detail & Related papers (2020-10-30T03:41:38Z) - GATE: Graph Attention Transformer Encoder for Cross-lingual Relation and
Event Extraction [107.8262586956778]
We introduce graph convolutional networks (GCNs) with universal dependency parses to learn language-agnostic sentence representations.
GCNs struggle to model words with long-range dependencies or are not directly connected in the dependency tree.
We propose to utilize the self-attention mechanism to learn the dependencies between words with different syntactic distances.
arXiv Detail & Related papers (2020-10-06T20:30:35Z) - Constructing a Family Tree of Ten Indo-European Languages with
Delexicalized Cross-linguistic Transfer Patterns [57.86480614673034]
We formalize the delexicalized transfer as interpretable tree-to-string and tree-to-tree patterns.
This allows us to quantitatively probe cross-linguistic transfer and extend inquiries of Second Language Acquisition.
arXiv Detail & Related papers (2020-07-17T15:56:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.