Multilingual Gradient Word-Order Typology from Universal Dependencies
- URL: http://arxiv.org/abs/2402.01513v1
- Date: Fri, 2 Feb 2024 15:54:19 GMT
- Title: Multilingual Gradient Word-Order Typology from Universal Dependencies
- Authors: Emi Baylor and Esther Ploeger and Johannes Bjerva
- Abstract summary: Existing typological databases, including WALS and Grambank, suffer from inconsistencies primarily caused by their categorical format.
We introduce a new seed dataset made up of continuous-valued data, rather than categorical data, that can better reflect the variability of language.
- Score: 2.968112652976397
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While information from the field of linguistic typology has the potential to
improve performance on NLP tasks, reliable typological data is a prerequisite.
Existing typological databases, including WALS and Grambank, suffer from
inconsistencies primarily caused by their categorical format. Furthermore,
typological categorisations by definition differ significantly from the
continuous nature of phenomena, as found in natural language corpora. In this
paper, we introduce a new seed dataset made up of continuous-valued data,
rather than categorical data, that can better reflect the variability of
language. While this initial dataset focuses on word-order typology, we also
present the methodology used to create the dataset, which can be easily adapted
to generate data for a broader set of features and languages.
Related papers
- Language Modeling on Tabular Data: A Survey of Foundations, Techniques and Evolution [7.681258910515419]
Tabular data presents unique challenges due to its heterogeneous nature and complex structural relationships.
High predictive performance and robustness in tabular data analysis holds significant promise for numerous applications.
The recent advent of large language models, such as GPT and LLaMA, has further revolutionized the field, facilitating more advanced and diverse applications with minimal fine-tuning.
arXiv Detail & Related papers (2024-08-20T04:59:19Z) - The Past, Present, and Future of Typological Databases in NLP [2.968112652976397]
Typological information has the potential to be beneficial in the development of NLP models.
Current large-scale typological databases, notably WALS and Grambank, are inconsistent both with each other and with other sources of typological information.
We shed light on this issue by systematically exploring disagreements across typological databases and resources, and their uses in NLP.
arXiv Detail & Related papers (2023-10-20T12:01:42Z) - Large Language Model as Attributed Training Data Generator: A Tale of
Diversity and Bias [92.41919689753051]
Large language models (LLMs) have been recently leveraged as training data generators for various natural language processing (NLP) tasks.
We investigate training data generation with diversely attributed prompts, which have the potential to yield diverse and attributed generated data.
We show that attributed prompts outperform simple class-conditional prompts in terms of the resulting model's performance.
arXiv Detail & Related papers (2023-06-28T03:31:31Z) - Diversify Your Vision Datasets with Automatic Diffusion-Based
Augmentation [66.6546668043249]
ALIA (Automated Language-guided Image Augmentation) is a method which utilizes large vision and language models to automatically generate natural language descriptions of a dataset's domains.
To maintain data integrity, a model trained on the original dataset filters out minimal image edits and those which corrupt class-relevant information.
We show that ALIA is able to surpasses traditional data augmentation and text-to-image generated data on fine-grained classification tasks.
arXiv Detail & Related papers (2023-05-25T17:43:05Z) - Language Embeddings Sometimes Contain Typological Generalizations [0.0]
We train neural models for a range of natural language processing tasks on a massively multilingual dataset of Bible translations in 1295 languages.
The learned language representations are then compared to existing typological databases as well as to a novel set of quantitative syntactic and morphological features.
We conclude that some generalizations are surprisingly close to traditional features from linguistic typology, but that most models, as well as those of previous work, do not appear to have made linguistically meaningful generalizations.
arXiv Detail & Related papers (2023-01-19T15:09:59Z) - Automatically Identifying Semantic Bias in Crowdsourced Natural Language
Inference Datasets [78.6856732729301]
We introduce a model-driven, unsupervised technique to find "bias clusters" in a learned embedding space of hypotheses in NLI datasets.
interventions and additional rounds of labeling can be performed to ameliorate the semantic bias of the hypothesis distribution of a dataset.
arXiv Detail & Related papers (2021-12-16T22:49:01Z) - Does Typological Blinding Impede Cross-Lingual Sharing? [31.20201199491578]
We show that a model trained in a cross-lingual setting will pick up on typological cues from the input data.
We investigate how cross-lingual sharing and performance is impacted.
arXiv Detail & Related papers (2021-01-28T09:32:08Z) - SDA: Improving Text Generation with Self Data Augmentation [88.24594090105899]
We propose to improve the standard maximum likelihood estimation (MLE) paradigm by incorporating a self-imitation-learning phase for automatic data augmentation.
Unlike most existing sentence-level augmentation strategies, our method is more general and could be easily adapted to any MLE-based training procedure.
arXiv Detail & Related papers (2021-01-02T01:15:57Z) - Linguistic Typology Features from Text: Inferring the Sparse Features of
World Atlas of Language Structures [73.06435180872293]
We construct a recurrent neural network predictor based on byte embeddings and convolutional layers.
We show that some features from various linguistic types can be predicted reliably.
arXiv Detail & Related papers (2020-04-30T21:00:53Z) - Evaluating Transformer-Based Multilingual Text Classification [55.53547556060537]
We argue that NLP tools perform unequally across languages with different syntactic and morphological structures.
We calculate word order and morphological similarity indices to aid our empirical study.
arXiv Detail & Related papers (2020-04-29T03:34:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.