Exploring language relations through syntactic distances and geographic proximity
- URL: http://arxiv.org/abs/2403.18430v2
- Date: Thu, 03 Oct 2024 08:24:40 GMT
- Title: Exploring language relations through syntactic distances and geographic proximity
- Authors: Juan De Gregorio, Raúl Toral, David Sánchez,
- Abstract summary: We explore linguistic distances using series of parts of speech (POS) extracted from the Universal Dependencies dataset.
We find definite clusters that correspond to well known language families and groups, with exceptions explained by distinct morphological typologies.
- Score: 0.4369550829556578
- License:
- Abstract: Languages are grouped into families that share common linguistic traits. While this approach has been successful in understanding genetic relations between diverse languages, more analyses are needed to accurately quantify their relatedness, especially in less studied linguistic levels such as syntax. Here, we explore linguistic distances using series of parts of speech (POS) extracted from the Universal Dependencies dataset. Within an information-theoretic framework, we show that employing POS trigrams maximizes the possibility of capturing syntactic variations while being at the same time compatible with the amount of available data. Linguistic connections are then established by assessing pairwise distances based on the POS distributions. Intriguingly, our analysis reveals definite clusters that correspond to well known language families and groups, with exceptions explained by distinct morphological typologies. Furthermore, we obtain a significant correlation between language similarity and geographic distance, which underscores the influence of spatial proximity on language kinships.
Related papers
- Patterns of Persistence and Diffusibility across the World's Languages [3.7055269158186874]
Colexification is a type of similarity where a single lexical form is used to convey multiple meanings.
We shed light on the linguistic causes of cross-lingual similarity in colexification and phonology.
We construct large-scale graphs incorporating semantic, genealogical, phonological and geographical data for 1,966 languages.
arXiv Detail & Related papers (2024-01-03T12:05:38Z) - Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - Linguistic dependencies and statistical dependence [76.89273585568084]
We use pretrained language models to estimate probabilities of words in context.
We find that maximum-CPMI trees correspond to linguistic dependencies more often than trees extracted from non-contextual PMI estimate.
arXiv Detail & Related papers (2021-04-18T02:43:37Z) - Probing Multilingual BERT for Genetic and Typological Signals [28.360662552057324]
We probe the layers in multilingual BERT (mBERT) for phylogenetic and geographic language signals across 100 languages.
We employ the language distances to infer and evaluate language trees, finding that they are close to the reference family tree in terms of quartet tree distance.
arXiv Detail & Related papers (2020-11-04T00:03:04Z) - Linguistic Typology Features from Text: Inferring the Sparse Features of
World Atlas of Language Structures [73.06435180872293]
We construct a recurrent neural network predictor based on byte embeddings and convolutional layers.
We show that some features from various linguistic types can be predicted reliably.
arXiv Detail & Related papers (2020-04-30T21:00:53Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z) - On the coexistence of competing languages [0.0]
We revisit the question of language competition, with an emphasis on uncovering the ways in which coexistence might emerge.
We find that this emergence is related to symmetry breaking, and explore two particular scenarios.
For each of these, the investigation of paradigmatic situations leads us to a quantitative understanding of the conditions leading to language coexistence.
arXiv Detail & Related papers (2020-03-10T14:06:55Z) - The Secret is in the Spectra: Predicting Cross-lingual Task Performance
with Spectral Similarity Measures [83.53361353172261]
We present a large-scale study focused on the correlations between monolingual embedding space similarity and task performance.
We introduce several isomorphism measures between two embedding spaces, based on the relevant statistics of their individual spectra.
We empirically show that 1) language similarity scores derived from such spectral isomorphism measures are strongly associated with performance observed in different cross-lingual tasks.
arXiv Detail & Related papers (2020-01-30T00:09:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.