The Shape of Word Embeddings: Recognizing Language Phylogenies through Topological Data Analysis
- URL: http://arxiv.org/abs/2404.00500v1
- Date: Sat, 30 Mar 2024 23:51:25 GMT
- Title: The Shape of Word Embeddings: Recognizing Language Phylogenies through Topological Data Analysis
- Authors: Ondřej Draganov, Steven Skiena,
- Abstract summary: We use persistent homology from topological data analysis to measure the distances between language pairs from the shape of their unlabeled embeddings.
We construct language phylogenetic trees over 81 Indo-European languages.
- Score: 10.242373477945376
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Word embeddings represent language vocabularies as clouds of $d$-dimensional points. We investigate how information is conveyed by the general shape of these clouds, outside of representing the semantic meaning of each token. Specifically, we use the notion of persistent homology from topological data analysis (TDA) to measure the distances between language pairs from the shape of their unlabeled embeddings. We use these distance matrices to construct language phylogenetic trees over 81 Indo-European languages. Careful evaluation shows that our reconstructed trees exhibit strong similarities to the reference tree.
Related papers
- UCxn: Typologically Informed Annotation of Constructions Atop Universal Dependencies [40.202120178465]
Grammatical constructions that convey meaning through a particular combination of several morphosyntactic elements are not labeled holistically.
We argue for augmenting UD annotations with a 'UCxn' annotation layer for such meaning-bearing grammatical constructions.
As a case study, we consider five construction families in ten languages, identifying instances of each construction in UD treebanks through the use of morphosyntactic patterns.
arXiv Detail & Related papers (2024-03-26T14:40:10Z) - Domain Embeddings for Generating Complex Descriptions of Concepts in
Italian Language [65.268245109828]
We propose a Distributional Semantic resource enriched with linguistic and lexical information extracted from electronic dictionaries.
The resource comprises 21 domain-specific matrices, one comprehensive matrix, and a Graphical User Interface.
Our model facilitates the generation of reasoned semantic descriptions of concepts by selecting matrices directly associated with concrete conceptual knowledge.
arXiv Detail & Related papers (2024-02-26T15:04:35Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - Low-Dimensional Structure in the Space of Language Representations is
Reflected in Brain Responses [62.197912623223964]
We show a low-dimensional structure where language models and translation models smoothly interpolate between word embeddings, syntactic and semantic tasks, and future word embeddings.
We find that this representation embedding can predict how well each individual feature space maps to human brain responses to natural language stimuli recorded using fMRI.
This suggests that the embedding captures some part of the brain's natural language representation structure.
arXiv Detail & Related papers (2021-06-09T22:59:12Z) - Understanding Synonymous Referring Expressions via Contrastive Features [105.36814858748285]
We develop an end-to-end trainable framework to learn contrastive features on the image and object instance levels.
We conduct extensive experiments to evaluate the proposed algorithm on several benchmark datasets.
arXiv Detail & Related papers (2021-04-20T17:56:24Z) - Probing Multilingual BERT for Genetic and Typological Signals [28.360662552057324]
We probe the layers in multilingual BERT (mBERT) for phylogenetic and geographic language signals across 100 languages.
We employ the language distances to infer and evaluate language trees, finding that they are close to the reference family tree in terms of quartet tree distance.
arXiv Detail & Related papers (2020-11-04T00:03:04Z) - Bio-inspired Structure Identification in Language Embeddings [3.5292026405502215]
We present a series of explorations using bio-inspired methodology to traverse and visualize word embeddings.
We show that our model can be used to investigate how different word embedding techniques result in different semantic outputs.
arXiv Detail & Related papers (2020-09-05T04:44:15Z) - Constructing a Family Tree of Ten Indo-European Languages with
Delexicalized Cross-linguistic Transfer Patterns [57.86480614673034]
We formalize the delexicalized transfer as interpretable tree-to-string and tree-to-tree patterns.
This allows us to quantitatively probe cross-linguistic transfer and extend inquiries of Second Language Acquisition.
arXiv Detail & Related papers (2020-07-17T15:56:54Z) - Exploiting Syntactic Structure for Better Language Modeling: A Syntactic
Distance Approach [78.77265671634454]
We make use of a multi-task objective, i.e., the models simultaneously predict words as well as ground truth parse trees in a form called "syntactic distances"
Experimental results on the Penn Treebank and Chinese Treebank datasets show that when ground truth parse trees are provided as additional training signals, the model is able to achieve lower perplexity and induce trees with better quality.
arXiv Detail & Related papers (2020-05-12T15:35:00Z) - Topological Data Analysis in Text Classification: Extracting Features
with Additive Information [2.1410799064827226]
Topological Data Analysis is challenging to apply to high dimensional numeric data.
Topological features carry some exclusive information not captured by conventional text mining methods.
Adding topological features to the conventional features in ensemble models improves the classification results.
arXiv Detail & Related papers (2020-03-29T21:02:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.