Linguistic Typology Features from Text: Inferring the Sparse Features of
World Atlas of Language Structures
- URL: http://arxiv.org/abs/2005.00100v2
- Date: Mon, 4 May 2020 20:53:11 GMT
- Title: Linguistic Typology Features from Text: Inferring the Sparse Features of
World Atlas of Language Structures
- Authors: Alexander Gutkin, Tatiana Merkulova and Martin Jansche
- Abstract summary: We construct a recurrent neural network predictor based on byte embeddings and convolutional layers.
We show that some features from various linguistic types can be predicted reliably.
- Score: 73.06435180872293
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The use of linguistic typological resources in natural language processing
has been steadily gaining more popularity. It has been observed that the use of
typological information, often combined with distributed language
representations, leads to significantly more powerful models. While linguistic
typology representations from various resources have mostly been used for
conditioning the models, there has been relatively little attention on
predicting features from these resources from the input data. In this paper we
investigate whether the various linguistic features from World Atlas of
Language Structures (WALS) can be reliably inferred from multi-lingual text.
Such a predictor can be used to infer structural features for a language never
observed in training data. We frame this task as a multi-label classification
involving predicting the set of non-mutually exclusive and extremely sparse
multi-valued labels (WALS features). We construct a recurrent neural network
predictor based on byte embeddings and convolutional layers and test its
performance on 556 languages, providing analysis for various linguistic types,
macro-areas, language families and individual features. We show that some
features from various linguistic types can be predicted reliably.
Related papers
- Language Embeddings Sometimes Contain Typological Generalizations [0.0]
We train neural models for a range of natural language processing tasks on a massively multilingual dataset of Bible translations in 1295 languages.
The learned language representations are then compared to existing typological databases as well as to a novel set of quantitative syntactic and morphological features.
We conclude that some generalizations are surprisingly close to traditional features from linguistic typology, but that most models, as well as those of previous work, do not appear to have made linguistically meaningful generalizations.
arXiv Detail & Related papers (2023-01-19T15:09:59Z) - Universal and Independent: Multilingual Probing Framework for Exhaustive
Model Interpretation and Evaluation [0.04199844472131922]
We present and apply the GUI-assisted framework allowing us to easily probe a massive number of languages.
Most of the regularities revealed in the mBERT model are typical for the western-European languages.
Our framework can be integrated with the existing probing toolboxes, model cards, and leaderboards.
arXiv Detail & Related papers (2022-10-24T13:41:17Z) - Towards Zero-shot Language Modeling [90.80124496312274]
We construct a neural model that is inductively biased towards learning human languages.
We infer this distribution from a sample of typologically diverse training languages.
We harness additional language-specific side information as distant supervision for held-out languages.
arXiv Detail & Related papers (2021-08-06T23:49:18Z) - SIGTYP 2020 Shared Task: Prediction of Typological Features [78.95376120154083]
A major drawback hampering broader adoption of typological KBs is that they are sparsely populated.
As typological features often correlate with one another, it is possible to predict them and thus automatically populate typological KBs.
Overall, the task attracted 8 submissions from 5 teams, out of which the most successful methods make use of such feature correlations.
arXiv Detail & Related papers (2020-10-16T08:47:24Z) - NEMO: Frequentist Inference Approach to Constrained Linguistic Typology
Feature Prediction in SIGTYP 2020 Shared Task [83.43738174234053]
We employ frequentist inference to represent correlations between typological features and use this representation to train simple multi-class estimators that predict individual features.
Our best configuration achieved the micro-averaged accuracy score of 0.66 on 149 test languages.
arXiv Detail & Related papers (2020-10-12T19:25:43Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z) - Cross-lingual, Character-Level Neural Morphological Tagging [57.0020906265213]
We train character-level recurrent neural taggers to predict morphological taggings for high-resource languages and low-resource languages together.
Learning joint character representations among multiple related languages successfully enables knowledge transfer from the high-resource languages to the low-resource ones, improving accuracy by up to 30% over a monolingual model.
arXiv Detail & Related papers (2017-08-30T08:14:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.