Self-organizing Pattern in Multilayer Network for Words and Syllables
- URL: http://arxiv.org/abs/2005.02087v1
- Date: Tue, 5 May 2020 12:01:47 GMT
- Title: Self-organizing Pattern in Multilayer Network for Words and Syllables
- Authors: Li-Min Wang, Sun-Ting Tsai, Shan-Jyun Wu, Meng-Xue Tsai, Daw-Wei Wang,
Yi-Ching Su, and Tzay-Ming Hong
- Abstract summary: We propose a new universal law that highlights the equally important role of syllables.
By plotting rank-rank frequency distribution of word and syllable for English and Chinese corpora, visible lines appear and can be fit to a master curve.
- Score: 17.69876273827734
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: One of the ultimate goals for linguists is to find universal properties in
human languages. Although words are generally considered as representing
arbitrary mapping between linguistic forms and meanings, we propose a new
universal law that highlights the equally important role of syllables, which is
complementary to Zipf's. By plotting rank-rank frequency distribution of word
and syllable for English and Chinese corpora, visible lines appear and can be
fit to a master curve. We discover the multi-layer network for words and
syllables based on this analysis exhibits the feature of self-organization
which relies heavily on the inclusion of syllables and their connections.
Analytic form for the scaling structure is derived and used to quantify how
Internet slang becomes fashionable, which demonstrates its usefulness as a new
tool to evolutionary linguistics.
Related papers
- Revisiting Syllables in Language Modelling and their Application on
Low-Resource Machine Translation [1.2617078020344619]
Syllables provide shorter sequences than characters, require less-specialised extracting rules than morphemes, and their segmentation is not impacted by the corpus size.
We first explore the potential of syllables for open-vocabulary language modelling in 21 languages.
We use rule-based syllabification methods for six languages and address the rest with hyphenation, which works as a syllabification proxy.
arXiv Detail & Related papers (2022-10-05T18:55:52Z) - Latent Topology Induction for Understanding Contextualized
Representations [84.7918739062235]
We study the representation space of contextualized embeddings and gain insight into the hidden topology of large language models.
We show there exists a network of latent states that summarize linguistic properties of contextualized representations.
arXiv Detail & Related papers (2022-06-03T11:22:48Z) - Feature-rich multiplex lexical networks reveal mental strategies of
early language learning [0.7111443975103329]
We introduce FEature-Rich MUltiplex LEXical (FERMULEX) networks.
Similarities model heterogenous word associations across semantic/syntactic/phonological aspects of knowledge.
Words are enriched with multi-dimensional feature embeddings including frequency, age of acquisition, length and polysemy.
arXiv Detail & Related papers (2022-01-13T16:44:51Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - Metrical Tagging in the Wild: Building and Annotating Poetry Corpora
with Rhythmic Features [0.0]
We provide large poetry corpora for English and German, and annotate prosodic features in smaller corpora to train corpus driven neural models.
We show that BiLSTM-CRF models with syllable embeddings outperform a CRF baseline and different BERT-based approaches.
arXiv Detail & Related papers (2021-02-17T16:38:57Z) - Seeing Both the Forest and the Trees: Multi-head Attention for Joint
Classification on Different Compositional Levels [15.453888735879525]
In natural languages, words are used in association to construct sentences.
We design a deep neural network architecture that explicitly wires lower and higher linguistic components.
We show that our model, MHAL, learns to simultaneously solve them at different levels of granularity.
arXiv Detail & Related papers (2020-11-01T10:44:46Z) - Revisiting Neural Language Modelling with Syllables [3.198144010381572]
We reconsider syllables for an open-vocabulary generation task in 20 languages.
We use rule-based syllabification methods for five languages and address the rest with a hyphenation tool.
With a comparable perplexity, we show that syllables outperform characters, annotated morphemes and unsupervised subwords.
arXiv Detail & Related papers (2020-10-24T11:44:41Z) - Investigating Cross-Linguistic Adjective Ordering Tendencies with a
Latent-Variable Model [66.84264870118723]
We present the first purely corpus-driven model of multi-lingual adjective ordering in the form of a latent-variable model.
We provide strong converging evidence for the existence of universal, cross-linguistic, hierarchical adjective ordering tendencies.
arXiv Detail & Related papers (2020-10-09T18:27:55Z) - Linguistic Typology Features from Text: Inferring the Sparse Features of
World Atlas of Language Structures [73.06435180872293]
We construct a recurrent neural network predictor based on byte embeddings and convolutional layers.
We show that some features from various linguistic types can be predicted reliably.
arXiv Detail & Related papers (2020-04-30T21:00:53Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z) - Where New Words Are Born: Distributional Semantic Analysis of Neologisms
and Their Semantic Neighborhoods [51.34667808471513]
We investigate the importance of two factors, semantic sparsity and frequency growth rates of semantic neighbors, formalized in the distributional semantics paradigm.
We show that both factors are predictive word emergence although we find more support for the latter hypothesis.
arXiv Detail & Related papers (2020-01-21T19:09:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.