The Past, Present, and Future of Typological Databases in NLP
- URL: http://arxiv.org/abs/2310.13440v1
- Date: Fri, 20 Oct 2023 12:01:42 GMT
- Title: The Past, Present, and Future of Typological Databases in NLP
- Authors: Emi Baylor and Esther Ploeger and Johannes Bjerva
- Abstract summary: Typological information has the potential to be beneficial in the development of NLP models.
Current large-scale typological databases, notably WALS and Grambank, are inconsistent both with each other and with other sources of typological information.
We shed light on this issue by systematically exploring disagreements across typological databases and resources, and their uses in NLP.
- Score: 2.968112652976397
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Typological information has the potential to be beneficial in the development
of NLP models, particularly for low-resource languages. Unfortunately, current
large-scale typological databases, notably WALS and Grambank, are inconsistent
both with each other and with other sources of typological information, such as
linguistic grammars. Some of these inconsistencies stem from coding errors or
linguistic variation, but many of the disagreements are due to the discrete
categorical nature of these databases. We shed light on this issue by
systematically exploring disagreements across typological databases and
resources, and their uses in NLP, covering the past and present. We next
investigate the future of such work, offering an argument that a continuous
view of typological features is clearly beneficial, echoing recommendations
from linguistics. We propose that such a view of typology has significant
potential in the future, including in language modeling in low-resource
scenarios.
Related papers
- Multilingual Gradient Word-Order Typology from Universal Dependencies [2.968112652976397]
Existing typological databases, including WALS and Grambank, suffer from inconsistencies primarily caused by their categorical format.
We introduce a new seed dataset made up of continuous-valued data, rather than categorical data, that can better reflect the variability of language.
arXiv Detail & Related papers (2024-02-02T15:54:19Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - Language Embeddings Sometimes Contain Typological Generalizations [0.0]
We train neural models for a range of natural language processing tasks on a massively multilingual dataset of Bible translations in 1295 languages.
The learned language representations are then compared to existing typological databases as well as to a novel set of quantitative syntactic and morphological features.
We conclude that some generalizations are surprisingly close to traditional features from linguistic typology, but that most models, as well as those of previous work, do not appear to have made linguistically meaningful generalizations.
arXiv Detail & Related papers (2023-01-19T15:09:59Z) - Towards Zero-shot Language Modeling [90.80124496312274]
We construct a neural model that is inductively biased towards learning human languages.
We infer this distribution from a sample of typologically diverse training languages.
We harness additional language-specific side information as distant supervision for held-out languages.
arXiv Detail & Related papers (2021-08-06T23:49:18Z) - Does Typological Blinding Impede Cross-Lingual Sharing? [31.20201199491578]
We show that a model trained in a cross-lingual setting will pick up on typological cues from the input data.
We investigate how cross-lingual sharing and performance is impacted.
arXiv Detail & Related papers (2021-01-28T09:32:08Z) - SIGTYP 2020 Shared Task: Prediction of Typological Features [78.95376120154083]
A major drawback hampering broader adoption of typological KBs is that they are sparsely populated.
As typological features often correlate with one another, it is possible to predict them and thus automatically populate typological KBs.
Overall, the task attracted 8 submissions from 5 teams, out of which the most successful methods make use of such feature correlations.
arXiv Detail & Related papers (2020-10-16T08:47:24Z) - Linguistic Typology Features from Text: Inferring the Sparse Features of
World Atlas of Language Structures [73.06435180872293]
We construct a recurrent neural network predictor based on byte embeddings and convolutional layers.
We show that some features from various linguistic types can be predicted reliably.
arXiv Detail & Related papers (2020-04-30T21:00:53Z) - Evaluating Transformer-Based Multilingual Text Classification [55.53547556060537]
We argue that NLP tools perform unequally across languages with different syntactic and morphological structures.
We calculate word order and morphological similarity indices to aid our empirical study.
arXiv Detail & Related papers (2020-04-29T03:34:53Z) - Cross-lingual, Character-Level Neural Morphological Tagging [57.0020906265213]
We train character-level recurrent neural taggers to predict morphological taggings for high-resource languages and low-resource languages together.
Learning joint character representations among multiple related languages successfully enables knowledge transfer from the high-resource languages to the low-resource ones, improving accuracy by up to 30% over a monolingual model.
arXiv Detail & Related papers (2017-08-30T08:14:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.