In search of isoglosses: continuous and discrete language embeddings in
Slavic historical phonology
- URL: http://arxiv.org/abs/2005.13575v1
- Date: Wed, 27 May 2020 18:10:46 GMT
- Title: In search of isoglosses: continuous and discrete language embeddings in
Slavic historical phonology
- Authors: Chundra A. Cathcart, Florian Wandl
- Abstract summary: We employ three different types of language embedding (dense, sigmoid, and straight-through)
We find that the Straight-Through model outperforms the other two in terms of accuracy, but the Sigmoid model's language embeddings show the strongest agreement with the traditional subgrouping of the Slavic languages.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper investigates the ability of neural network architectures to
effectively learn diachronic phonological generalizations in a multilingual
setting. We employ models using three different types of language embedding
(dense, sigmoid, and straight-through). We find that the Straight-Through model
outperforms the other two in terms of accuracy, but the Sigmoid model's
language embeddings show the strongest agreement with the traditional
subgrouping of the Slavic languages. We find that the Straight-Through model
has learned coherent, semi-interpretable information about sound change, and
outline directions for future research.
Related papers
- A Comparative Analysis of Bilingual and Trilingual Wav2Vec Models for Automatic Speech Recognition in Multilingual Oral History Archives [2.3592914313389257]
We are comparing monolingual Wav2Vec 2.0 models with various multilingual models to see whether we could improve speech recognition performance.
Our results suggest that monolingual speech recognition models are, in most cases, superior to multilingual models.
arXiv Detail & Related papers (2024-07-24T11:03:47Z) - Learning Cross-lingual Visual Speech Representations [108.68531445641769]
Cross-lingual self-supervised visual representation learning has been a growing research topic in the last few years.
We use the recently-proposed Raw Audio-Visual Speechs (RAVEn) framework to pre-train an audio-visual model with unlabelled data.
Our experiments show that: (1) multi-lingual models with more data outperform monolingual ones, but, when keeping the amount of data fixed, monolingual models tend to reach better performance.
arXiv Detail & Related papers (2023-03-14T17:05:08Z) - Language Embeddings Sometimes Contain Typological Generalizations [0.0]
We train neural models for a range of natural language processing tasks on a massively multilingual dataset of Bible translations in 1295 languages.
The learned language representations are then compared to existing typological databases as well as to a novel set of quantitative syntactic and morphological features.
We conclude that some generalizations are surprisingly close to traditional features from linguistic typology, but that most models, as well as those of previous work, do not appear to have made linguistically meaningful generalizations.
arXiv Detail & Related papers (2023-01-19T15:09:59Z) - Accidental Learners: Spoken Language Identification in Multilingual
Self-Supervised Models [11.439430077017635]
We find that pre-trained speech models optimally encode language discriminatory information in lower layers.
We demonstrate that the embeddings obtained from these layers are significantly robust to classify unseen languages.
We open-source the model through the NVIDIA NeMo toolkit.
arXiv Detail & Related papers (2022-11-09T18:53:59Z) - Integrating Linguistic Theory and Neural Language Models [2.870517198186329]
I present several case studies to illustrate how theoretical linguistics and neural language models are still relevant to each other.
This thesis contributes three studies that explore different aspects of the syntax-semantics interface in language models.
arXiv Detail & Related papers (2022-07-20T04:20:46Z) - Is neural language acquisition similar to natural? A chronological
probing study [0.0515648410037406]
We present the chronological probing study of transformer English models such as MultiBERT and T5.
We compare the information about the language learned by the models in the process of training on corpora.
The results show that 1) linguistic information is acquired in the early stages of training 2) both language models demonstrate capabilities to capture various features from various levels of language.
arXiv Detail & Related papers (2022-07-01T17:24:11Z) - Same Neurons, Different Languages: Probing Morphosyntax in Multilingual
Pre-trained Models [84.86942006830772]
We conjecture that multilingual pre-trained models can derive language-universal abstractions about grammar.
We conduct the first large-scale empirical study over 43 languages and 14 morphosyntactic categories with a state-of-the-art neuron-level probe.
arXiv Detail & Related papers (2022-05-04T12:22:31Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - Towards Zero-shot Language Modeling [90.80124496312274]
We construct a neural model that is inductively biased towards learning human languages.
We infer this distribution from a sample of typologically diverse training languages.
We harness additional language-specific side information as distant supervision for held-out languages.
arXiv Detail & Related papers (2021-08-06T23:49:18Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Linguistic Typology Features from Text: Inferring the Sparse Features of
World Atlas of Language Structures [73.06435180872293]
We construct a recurrent neural network predictor based on byte embeddings and convolutional layers.
We show that some features from various linguistic types can be predicted reliably.
arXiv Detail & Related papers (2020-04-30T21:00:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.