Global Syntactic Variation in Seven Languages: Towards a Computational
Dialectology
- URL: http://arxiv.org/abs/2104.01306v1
- Date: Sat, 3 Apr 2021 03:40:21 GMT
- Title: Global Syntactic Variation in Seven Languages: Towards a Computational
Dialectology
- Authors: Jonathan Dunn
- Abstract summary: We use Computational Construction Grammar to provide a replicable and falsifiable set of syntactic features.
We use global language mapping based on web-crawled and social media datasets to determine the selection of national varieties.
Results show that models for each language are able to robustly predict the region-of-origin of held-out samples better using Construction Grammars.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The goal of this paper is to provide a complete representation of regional
linguistic variation on a global scale. To this end, the paper focuses on
removing three constraints that have previously limited work within
dialectology/dialectometry. First, rather than assuming a fixed and incomplete
set of variants, we use Computational Construction Grammar to provide a
replicable and falsifiable set of syntactic features. Second, rather than
assuming a specific area of interest, we use global language mapping based on
web-crawled and social media datasets to determine the selection of national
varieties. Third, rather than looking at a single language in isolation, we
model seven major languages together using the same methods: Arabic, English,
French, German, Portuguese, Russian, and Spanish. Results show that models for
each language are able to robustly predict the region-of-origin of held-out
samples better using Construction Grammars than using simpler syntactic
features. These global-scale experiments are used to argue that new methods in
computational sociolinguistics are able to provide more generalized models of
regional variation that are essential for understanding language variation and
change at scale.
Related papers
- Exploring syntactic information in sentence embeddings through multilingual subject-verb agreement [1.4335183427838039]
We take the approach of developing curated synthetic data on a large scale, with specific properties.
We use a new multiple-choice task and datasets, Blackbird Language Matrices, to focus on a specific grammatical structural phenomenon.
We show that despite having been trained on multilingual texts in a consistent manner, multilingual pretrained language models have language-specific differences.
arXiv Detail & Related papers (2024-09-10T14:58:55Z) - Are Structural Concepts Universal in Transformer Language Models?
Towards Interpretable Cross-Lingual Generalization [27.368684663279463]
We investigate the potential for explicitly aligning conceptual correspondence between languages to enhance cross-lingual generalization.
Using the syntactic aspect of language as a testbed, our analyses of 43 languages reveal a high degree of alignability.
We propose a meta-learning-based method to learn to align conceptual spaces of different languages.
arXiv Detail & Related papers (2023-10-19T14:50:51Z) - Universal and Independent: Multilingual Probing Framework for Exhaustive
Model Interpretation and Evaluation [0.04199844472131922]
We present and apply the GUI-assisted framework allowing us to easily probe a massive number of languages.
Most of the regularities revealed in the mBERT model are typical for the western-European languages.
Our framework can be integrated with the existing probing toolboxes, model cards, and leaderboards.
arXiv Detail & Related papers (2022-10-24T13:41:17Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - Towards Zero-shot Language Modeling [90.80124496312274]
We construct a neural model that is inductively biased towards learning human languages.
We infer this distribution from a sample of typologically diverse training languages.
We harness additional language-specific side information as distant supervision for held-out languages.
arXiv Detail & Related papers (2021-08-06T23:49:18Z) - Specializing Multilingual Language Models: An Empirical Study [50.7526245872855]
Contextualized word representations from pretrained multilingual language models have become the de facto standard for addressing natural language tasks.
For languages rarely or never seen by these models, directly using such models often results in suboptimal representation or use of data.
arXiv Detail & Related papers (2021-06-16T18:13:55Z) - Cross-lingual Spoken Language Understanding with Regularized
Representation Alignment [71.53159402053392]
We propose a regularization approach to align word-level and sentence-level representations across languages without any external resource.
Experiments on the cross-lingual spoken language understanding task show that our model outperforms current state-of-the-art methods in both few-shot and zero-shot scenarios.
arXiv Detail & Related papers (2020-09-30T08:56:53Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z) - Linguistic Typology Features from Text: Inferring the Sparse Features of
World Atlas of Language Structures [73.06435180872293]
We construct a recurrent neural network predictor based on byte embeddings and convolutional layers.
We show that some features from various linguistic types can be predicted reliably.
arXiv Detail & Related papers (2020-04-30T21:00:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.