Mechanism of Evolution Shared by Gene and Language
- URL: http://arxiv.org/abs/2012.14309v1
- Date: Mon, 28 Dec 2020 15:46:19 GMT
- Title: Mechanism of Evolution Shared by Gene and Language
- Authors: Li-Min Wang, Hsing-Yi Lai, Sun-Ting Tsai, Shan-Jyun Wu, Meng-Xue Tsai,
Daw-Wei Wang, Yi-Ching Su, Chen Siang Ng, and Tzay-Ming Hong
- Abstract summary: We propose a general mechanism for evolution to explain the diversity of gene and language.
We find that the classical correspondence, "domain plays the role of word in gene language", is not rigorous.
We devise a new evolution unit, syllgram, to include the characteristics of spoken and written language.
- Score: 8.882751635947027
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a general mechanism for evolution to explain the diversity of gene
and language. To quantify their common features and reveal the hidden
structures, several statistical properties and patterns are examined based on a
new method called the rank-rank analysis. We find that the classical
correspondence, "domain plays the role of word in gene language", is not
rigorous, and propose to replace domain by protein. In addition, we devise a
new evolution unit, syllgram, to include the characteristics of spoken and
written language. Based on the correspondence between (protein, domain) and
(word, syllgram), we discover that both gene and language shared a common
scaling structure and scale-free network. Like the Rosetta stone, this work may
help decipher the secret behind non-coding DNA and unknown languages.
Related papers
- Word length predicts word order: "Min-max"-ing drives language evolution [0.0]
This paper proposes a universal underlying mechanism for word order change based on a large tagged parallel dataset of over 1,500 languages.<n>Findings suggest an integrated "Min-Max" theory of language evolution driven by competing pressures of processing and information structure.
arXiv Detail & Related papers (2025-05-20T04:25:55Z) - GRAPE: Heterogeneous Graph Representation Learning for Genetic Perturbation with Coding and Non-Coding Biotype [51.58774936662233]
Building gene regulatory networks (GRN) is essential to understand and predict the effects of genetic perturbations.<n>In this work, we leverage pre-trained large language model and DNA sequence model to extract features from gene descriptions and DNA sequence data.<n>We introduce gene biotype information for the first time in genetic perturbation, simulating the distinct roles of genes with different biotypes in regulating cellular processes.
arXiv Detail & Related papers (2025-05-06T03:35:24Z) - Can linguists better understand DNA? [0.0]
This study addresses the existence of capability transfer between natural language and gene sequences/languages.
We constructed two analogous tasks: DNA-pair classification(DNA sequence similarity) and DNA-protein-pair classification(gene coding determination)
These tasks were designed to validate the transferability of capabilities from natural language to gene sequences.
arXiv Detail & Related papers (2024-12-10T17:06:33Z) - Entropy and type-token ratio in gigaword corpora [0.0]
We investigate entropy and text-token ratio, two metrics for lexical diversities, in six massive linguistic datasets in English, Spanish, and Turkish.
We find a functional relation between entropy and text-token ratio that holds across the corpora under consideration.
Our results contribute to the theoretical understanding of text structure and offer practical implications for fields like natural language processing.
arXiv Detail & Related papers (2024-11-15T14:40:59Z) - Training Neural Networks as Recognizers of Formal Languages [87.06906286950438]
Formal language theory pertains specifically to recognizers.
It is common to instead use proxy tasks that are similar in only an informal sense.
We correct this mismatch by training and evaluating neural networks directly as binary classifiers of strings.
arXiv Detail & Related papers (2024-11-11T16:33:25Z) - VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning.
By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z) - A single-cell gene expression language model [2.9112649816695213]
We propose a machine learning system to learn context dependencies between genes.
Our model, Exceiver, is trained across a diversity of cell types using a self-supervised task.
We found agreement between the similarity profiles of latent sample representations and learned gene embeddings with respect to biological annotations.
arXiv Detail & Related papers (2022-10-25T20:52:19Z) - Crosslinguistic word order variation reflects evolutionary pressures of
dependency and information locality [4.869029215261254]
About 40% of the world's languages have subject-verb-object order, and about 40% have subject-object-verb order.
We show that variation in word order reflects different ways of balancing competing pressures of dependency locality and information locality.
Our findings suggest that syntactic structure and usage across languages co-adapt to support efficient communication under limited cognitive resources.
arXiv Detail & Related papers (2022-06-09T02:56:53Z) - SemanticCAP: Chromatin Accessibility Prediction Enhanced by Features
Learning from a Language Model [3.0643865202019698]
We propose a new solution named SemanticCAP to identify accessible regions of the genome.
It introduces a gene language model which models the context of gene sequences, thus being able to provide an effective representation of gene sequences.
Compared with other systems under public benchmarks, our model proved to have better performance.
arXiv Detail & Related papers (2022-04-05T11:47:58Z) - Low-Dimensional Structure in the Space of Language Representations is
Reflected in Brain Responses [62.197912623223964]
We show a low-dimensional structure where language models and translation models smoothly interpolate between word embeddings, syntactic and semantic tasks, and future word embeddings.
We find that this representation embedding can predict how well each individual feature space maps to human brain responses to natural language stimuli recorded using fMRI.
This suggests that the embedding captures some part of the brain's natural language representation structure.
arXiv Detail & Related papers (2021-06-09T22:59:12Z) - Self-organizing Pattern in Multilayer Network for Words and Syllables [17.69876273827734]
We propose a new universal law that highlights the equally important role of syllables.
By plotting rank-rank frequency distribution of word and syllable for English and Chinese corpora, visible lines appear and can be fit to a master curve.
arXiv Detail & Related papers (2020-05-05T12:01:47Z) - Linguistic Typology Features from Text: Inferring the Sparse Features of
World Atlas of Language Structures [73.06435180872293]
We construct a recurrent neural network predictor based on byte embeddings and convolutional layers.
We show that some features from various linguistic types can be predicted reliably.
arXiv Detail & Related papers (2020-04-30T21:00:53Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z) - Evaluating Transformer-Based Multilingual Text Classification [55.53547556060537]
We argue that NLP tools perform unequally across languages with different syntactic and morphological structures.
We calculate word order and morphological similarity indices to aid our empirical study.
arXiv Detail & Related papers (2020-04-29T03:34:53Z) - Geospatial distributions reflect rates of evolution of features of language [0.0]
We propose a model-based approach to the problem through the analysis of language change as a process combining vertical descent, spatial interactions, and mutations in both dimensions.
A notion of linguistic temperature emerges naturally from this analysis as a dimensionless measure of the propensity of a linguistic feature to undergo change.
We demonstrate how temperatures of linguistic features can be inferred from their present-day geospatial distributions, without recourse to information about their phylogenies.
arXiv Detail & Related papers (2018-01-29T17:24:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.