Cross-lingual Word Embeddings in Hyperbolic Space
- URL: http://arxiv.org/abs/2205.01907v1
- Date: Wed, 4 May 2022 06:15:37 GMT
- Title: Cross-lingual Word Embeddings in Hyperbolic Space
- Authors: Chandni Saxena, Mudit Chaudhary, Helen Meng
- Abstract summary: Cross-lingual word embeddings can be applied to several natural language processing applications across multiple languages.
This paper presents a simple and effective cross-lingual Word2Vec model that adapts to the Poincar'e ball model of hyperbolic space.
- Score: 31.888489552069146
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Cross-lingual word embeddings can be applied to several natural language
processing applications across multiple languages. Unlike prior works that use
word embeddings based on the Euclidean space, this short paper presents a
simple and effective cross-lingual Word2Vec model that adapts to the Poincar\'e
ball model of hyperbolic space to learn unsupervised cross-lingual word
representations from a German-English parallel corpus. It has been shown that
hyperbolic embeddings can capture and preserve hierarchical relationships. We
evaluate the model on both hypernymy and analogy tasks. The proposed model
achieves comparable performance with the vanilla Word2Vec model on the
cross-lingual analogy task, the hypernymy task shows that the cross-lingual
Poincar\'e Word2Vec model can capture latent hierarchical structure from free
text across languages, which are absent from the Euclidean-based Word2Vec
representations. Our results show that by preserving the latent hierarchical
information, hyperbolic spaces can offer better representations for
cross-lingual embeddings.
Related papers
- FUSE-ing Language Models: Zero-Shot Adapter Discovery for Prompt Optimization Across Tokenizers [55.2480439325792]
We propose FUSE, an approach to approximating an adapter layer that maps from one model's textual embedding space to another, even across different tokenizers.
We show the efficacy of our approach via multi-objective optimization over vision-language and causal language models for image captioning and sentiment-based image captioning.
arXiv Detail & Related papers (2024-08-09T02:16:37Z) - Exploring Alignment in Shared Cross-lingual Spaces [15.98134426166435]
We employ clustering to uncover latent concepts within multilingual models.
Our analysis focuses on quantifying the textitalignment and textitoverlap of these concepts across various languages.
Our study encompasses three multilingual models (textttmT5, texttmBERT, and textttXLM-R) and three downstream tasks (Machine Translation, Named Entity Recognition, and Sentiment Analysis)
arXiv Detail & Related papers (2024-05-23T13:20:24Z) - Tokenization Impacts Multilingual Language Modeling: Assessing
Vocabulary Allocation and Overlap Across Languages [3.716965622352967]
We propose new criteria to evaluate the quality of lexical representation and vocabulary overlap observed in sub-word tokenizers.
Our findings show that the overlap of vocabulary across languages can be actually detrimental to certain downstream tasks.
arXiv Detail & Related papers (2023-05-26T18:06:49Z) - VECO 2.0: Cross-lingual Language Model Pre-training with
Multi-granularity Contrastive Learning [56.47303426167584]
We propose a cross-lingual pre-trained model VECO2.0 based on contrastive learning with multi-granularity alignments.
Specifically, the sequence-to-sequence alignment is induced to maximize the similarity of the parallel pairs and minimize the non-parallel pairs.
token-to-token alignment is integrated to bridge the gap between synonymous tokens excavated via the thesaurus dictionary from the other unpaired tokens in a bilingual instance.
arXiv Detail & Related papers (2023-04-17T12:23:41Z) - Robust Unsupervised Cross-Lingual Word Embedding using Domain Flow
Interpolation [48.32604585839687]
Previous adversarial approaches have shown promising results in inducing cross-lingual word embedding without parallel data.
We propose to make use of a sequence of intermediate spaces for smooth bridging.
arXiv Detail & Related papers (2022-10-07T04:37:47Z) - Lightweight Cross-Lingual Sentence Representation Learning [57.9365829513914]
We introduce a lightweight dual-transformer architecture with just 2 layers for generating memory-efficient cross-lingual sentence representations.
We propose a novel cross-lingual language model, which combines the existing single-word masked language model with the newly proposed cross-lingual token-level reconstruction task.
arXiv Detail & Related papers (2021-05-28T14:10:48Z) - VECO: Variable and Flexible Cross-lingual Pre-training for Language
Understanding and Generation [77.82373082024934]
We plug a cross-attention module into the Transformer encoder to explicitly build the interdependence between languages.
It can effectively avoid the degeneration of predicting masked words only conditioned on the context in its own language.
The proposed cross-lingual model delivers new state-of-the-art results on various cross-lingual understanding tasks of the XTREME benchmark.
arXiv Detail & Related papers (2020-10-30T03:41:38Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Improving Multilingual Models with Language-Clustered Vocabularies [8.587129426070979]
We introduce a novel procedure for multilingual vocabulary generation that combines the separately trained vocabularies of several automatically derived language clusters.
Our experiments show improvements across languages on key multilingual benchmark tasks.
arXiv Detail & Related papers (2020-10-24T04:49:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.