The Shape of Word Embeddings: Quantifying Non-Isometry With Topological Data Analysis
- URL: http://arxiv.org/abs/2404.00500v2
- Date: Mon, 11 Nov 2024 09:51:33 GMT
- Title: The Shape of Word Embeddings: Quantifying Non-Isometry With Topological Data Analysis
- Authors: Ondřej Draganov, Steven Skiena,
- Abstract summary: We use persistent homology from topological data analysis to measure the distances between language pairs from the shape of their unlabeled embeddings.
To distinguish whether these differences are random training errors or capture real information about the languages, we use the computed distance matrices to construct language phylogenetic trees over 81 Indo-European languages.
- Score: 10.242373477945376
- License:
- Abstract: Word embeddings represent language vocabularies as clouds of $d$-dimensional points. We investigate how information is conveyed by the general shape of these clouds, instead of representing the semantic meaning of each token. Specifically, we use the notion of persistent homology from topological data analysis (TDA) to measure the distances between language pairs from the shape of their unlabeled embeddings. These distances quantify the degree of non-isometry of the embeddings. To distinguish whether these differences are random training errors or capture real information about the languages, we use the computed distance matrices to construct language phylogenetic trees over 81 Indo-European languages. Careful evaluation shows that our reconstructed trees exhibit strong and statistically-significant similarities to the reference.
Related papers
- Geometric Signatures of Compositionality Across a Language Model's Lifetime [47.25475802128033]
We show that compositionality is reflected in representations' intrinsic dimensionality.
We also show that the relationship between compositionality and geometric complexity arises due to learned linguistic features over training.
arXiv Detail & Related papers (2024-10-02T11:54:06Z) - Evaluating Contextualized Representations of (Spanish) Ambiguous Words: A New Lexical Resource and Empirical Analysis [2.2530496464901106]
We evaluate semantic representations of Spanish ambiguous nouns in context in a suite of Spanish-language monolingual and multilingual BERT-based models.
We find that various BERT-based LMs' contextualized semantic representations capture some variance in human judgments but fall short of the human benchmark.
arXiv Detail & Related papers (2024-06-20T18:58:11Z) - How well do distributed representations convey contextual lexical semantics: a Thesis Proposal [3.3585951129432323]
In this thesis, we examine the efficacy of distributed representations from modern neural networks in encoding lexical meaning.
We identify four sources of ambiguity based on the relatedness and similarity of meanings influenced by context.
We then aim to evaluate these sources by collecting or constructing multilingual datasets, leveraging various language models, and employing linguistic analysis tools.
arXiv Detail & Related papers (2024-06-02T14:08:51Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - Discrete representations in neural models of spoken language [56.29049879393466]
We compare the merits of four commonly used metrics in the context of weakly supervised models of spoken language.
We find that the different evaluation metrics can give inconsistent results.
arXiv Detail & Related papers (2021-05-12T11:02:02Z) - Understanding Synonymous Referring Expressions via Contrastive Features [105.36814858748285]
We develop an end-to-end trainable framework to learn contrastive features on the image and object instance levels.
We conduct extensive experiments to evaluate the proposed algorithm on several benchmark datasets.
arXiv Detail & Related papers (2021-04-20T17:56:24Z) - Linguistic dependencies and statistical dependence [76.89273585568084]
We use pretrained language models to estimate probabilities of words in context.
We find that maximum-CPMI trees correspond to linguistic dependencies more often than trees extracted from non-contextual PMI estimate.
arXiv Detail & Related papers (2021-04-18T02:43:37Z) - Probing Multilingual BERT for Genetic and Typological Signals [28.360662552057324]
We probe the layers in multilingual BERT (mBERT) for phylogenetic and geographic language signals across 100 languages.
We employ the language distances to infer and evaluate language trees, finding that they are close to the reference family tree in terms of quartet tree distance.
arXiv Detail & Related papers (2020-11-04T00:03:04Z) - Bio-inspired Structure Identification in Language Embeddings [3.5292026405502215]
We present a series of explorations using bio-inspired methodology to traverse and visualize word embeddings.
We show that our model can be used to investigate how different word embedding techniques result in different semantic outputs.
arXiv Detail & Related papers (2020-09-05T04:44:15Z) - Exploiting Syntactic Structure for Better Language Modeling: A Syntactic
Distance Approach [78.77265671634454]
We make use of a multi-task objective, i.e., the models simultaneously predict words as well as ground truth parse trees in a form called "syntactic distances"
Experimental results on the Penn Treebank and Chinese Treebank datasets show that when ground truth parse trees are provided as additional training signals, the model is able to achieve lower perplexity and induce trees with better quality.
arXiv Detail & Related papers (2020-05-12T15:35:00Z) - Topological Data Analysis in Text Classification: Extracting Features
with Additive Information [2.1410799064827226]
Topological Data Analysis is challenging to apply to high dimensional numeric data.
Topological features carry some exclusive information not captured by conventional text mining methods.
Adding topological features to the conventional features in ensemble models improves the classification results.
arXiv Detail & Related papers (2020-03-29T21:02:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.