Are All Good Word Vector Spaces Isomorphic?
- URL: http://arxiv.org/abs/2004.04070v2
- Date: Tue, 20 Oct 2020 17:22:02 GMT
- Title: Are All Good Word Vector Spaces Isomorphic?
- Authors: Ivan Vuli\'c, Sebastian Ruder, and Anders S{\o}gaard
- Abstract summary: We show that variance in performance across language pairs is not only due to typological differences, but can mostly be attributed to the size of the monolingual resources available.
- Score: 79.04509759167952
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing algorithms for aligning cross-lingual word vector spaces assume that
vector spaces are approximately isomorphic. As a result, they perform poorly or
fail completely on non-isomorphic spaces. Such non-isomorphism has been
hypothesised to result from typological differences between languages. In this
work, we ask whether non-isomorphism is also crucially a sign of degenerate
word vector spaces. We present a series of experiments across diverse languages
which show that variance in performance across language pairs is not only due
to typological differences, but can mostly be attributed to the size of the
monolingual resources available, and to the properties and duration of
monolingual training (e.g. "under-training").
Related papers
- GRI: Graph-based Relative Isomorphism of Word Embedding Spaces [10.984134369344117]
Automated construction of bilingual dictionaries using monolingual embedding spaces is a core challenge in machine translation.
Existing attempts aimed at controlling the relative isomorphism of different spaces fail to incorporate the impact of semantically related words in the training objective.
We propose GRI that combines the distributional training objectives with attentive graph convolutions to unanimously consider the impact of semantically similar words.
arXiv Detail & Related papers (2023-10-18T22:10:47Z) - Exploring Anisotropy and Outliers in Multilingual Language Models for
Cross-Lingual Semantic Sentence Similarity [64.18762301574954]
Previous work has shown that the representations output by contextual language models are more anisotropic than static type embeddings.
This seems to be true for both monolingual and multilingual models, although much less work has been done on the multilingual context.
We investigate outlier dimensions and their relationship to anisotropy in multiple pre-trained multilingual language models.
arXiv Detail & Related papers (2023-06-01T09:01:48Z) - Lexinvariant Language Models [84.2829117441298]
Token embeddings, a mapping from discrete lexical symbols to continuous vectors, are at the heart of any language model (LM)
We study textitlexinvariantlanguage models that are invariant to lexical symbols and therefore do not need fixed token embeddings in practice.
We show that a lexinvariant LM can attain perplexity comparable to that of a standard language model, given a sufficiently long context.
arXiv Detail & Related papers (2023-05-24T19:10:46Z) - IsoVec: Controlling the Relative Isomorphism of Word Embedding Spaces [24.256732557154486]
We address the root-cause of faulty cross-lingual mapping: that word embedding training resulted in the underlying spaces being non-isomorphic.
We incorporate global measures of isomorphism directly into the Skip-gram loss function, successfully increasing the relative isomorphism of trained word embedding spaces.
arXiv Detail & Related papers (2022-10-11T02:29:34Z) - Simple, Interpretable and Stable Method for Detecting Words with Usage
Change across Corpora [54.757845511368814]
The problem of comparing two bodies of text and searching for words that differ in their usage arises often in digital humanities and computational social science.
This is commonly approached by training word embeddings on each corpus, aligning the vector spaces, and looking for words whose cosine distance in the aligned space is large.
We propose an alternative approach that does not use vector space alignment, and instead considers the neighbors of each word.
arXiv Detail & Related papers (2021-12-28T23:46:00Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - Word Embedding Transformation for Robust Unsupervised Bilingual Lexicon
Induction [21.782189001319935]
We propose a transformation-based method to increase the isomorphism of embeddings of two languages.
Our approach can achieve competitive or superior performance compared to state-of-the-art methods.
arXiv Detail & Related papers (2021-05-26T02:09:58Z) - Inducing Language-Agnostic Multilingual Representations [61.97381112847459]
Cross-lingual representations have the potential to make NLP techniques available to the vast majority of languages in the world.
We examine three approaches for this: (i) re-aligning the vector spaces of target languages to a pivot source language; (ii) removing language-specific means and variances, which yields better discriminativeness of embeddings as a by-product; and (iii) increasing input similarity across languages by removing morphological contractions and sentence reordering.
arXiv Detail & Related papers (2020-08-20T17:58:56Z) - LNMap: Departures from Isomorphic Assumption in Bilingual Lexicon
Induction Through Non-Linear Mapping in Latent Space [17.49073364781107]
We propose a novel semi-supervised method to learn cross-lingual word embeddings for bilingual lexicon induction.
Our model is independent of the isomorphic assumption and uses nonlinear mapping in the latent space of two independently trained auto-encoders.
arXiv Detail & Related papers (2020-04-28T23:28:26Z) - Refinement of Unsupervised Cross-Lingual Word Embeddings [2.4366811507669124]
Cross-lingual word embeddings aim to bridge the gap between high-resource and low-resource languages.
We propose a self-supervised method to refine the alignment of unsupervised bilingual word embeddings.
arXiv Detail & Related papers (2020-02-21T10:39:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.