Unsupervised Word Translation Pairing using Refinement based Point Set
Registration
- URL: http://arxiv.org/abs/2011.13200v1
- Date: Thu, 26 Nov 2020 09:51:29 GMT
- Title: Unsupervised Word Translation Pairing using Refinement based Point Set
Registration
- Authors: Silviu Oprea and Sourav Dutta and Haytham Assem
- Abstract summary: Cross-lingual alignment of word embeddings play an important role in knowledge transfer across languages.
Current unsupervised approaches rely on similarities in geometric structure of word embedding spaces across languages.
This paper proposes BioSpere, a novel framework for unsupervised mapping of bi-lingual word embeddings onto a shared vector space.
- Score: 8.568050813210823
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Cross-lingual alignment of word embeddings play an important role in
knowledge transfer across languages, for improving machine translation and
other multi-lingual applications. Current unsupervised approaches rely on
similarities in geometric structure of word embedding spaces across languages,
to learn structure-preserving linear transformations using adversarial networks
and refinement strategies. However, such techniques, in practice, tend to
suffer from instability and convergence issues, requiring tedious fine-tuning
for precise parameter setting. This paper proposes BioSpere, a novel framework
for unsupervised mapping of bi-lingual word embeddings onto a shared vector
space, by combining adversarial initialization and refinement procedure with
point set registration algorithm used in image processing. We show that our
framework alleviates the shortcomings of existing methodologies, and is
relatively invariant to variable adversarial learning performance, depicting
robustness in terms of parameter choices and training losses. Experimental
evaluation on parallel dictionary induction task demonstrates state-of-the-art
results for our framework on diverse language pairs.
Related papers
- Pointer-Guided Pre-Training: Infusing Large Language Models with Paragraph-Level Contextual Awareness [3.2925222641796554]
"pointer-guided segment ordering" (SO) is a novel pre-training technique aimed at enhancing the contextual understanding of paragraph-level text representations.
Our experiments show that pointer-guided pre-training significantly enhances the model's ability to understand complex document structures.
arXiv Detail & Related papers (2024-06-06T15:17:51Z) - An Analysis of BPE Vocabulary Trimming in Neural Machine Translation [56.383793805299234]
vocabulary trimming is a postprocessing step that replaces rare subwords with their component subwords.
We show that vocabulary trimming fails to improve performance and is even prone to incurring heavy degradation.
arXiv Detail & Related papers (2024-03-30T15:29:49Z) - Cross-domain Chinese Sentence Pattern Parsing [67.1381983012038]
Sentence Pattern Structure (SPS) parsing is a syntactic analysis method primarily employed in language teaching.
Existing SPSs rely heavily on textbook corpora for training, lacking cross-domain capability.
This paper proposes an innovative approach leveraging large language models (LLMs) within a self-training framework.
arXiv Detail & Related papers (2024-02-26T05:30:48Z) - Idioms, Probing and Dangerous Things: Towards Structural Probing for
Idiomaticity in Vector Space [2.5288257442251107]
The goal of this paper is to learn more about how idiomatic information is structurally encoded in embeddings.
We perform a comparative probing study of static (GloVe) and contextual (BERT) embeddings.
Our experiments indicate that both encode some idiomatic information to varying degrees, but yield conflicting evidence as to whether idiomaticity is encoded in the vector norm.
arXiv Detail & Related papers (2023-04-27T17:06:20Z) - Robust Unsupervised Cross-Lingual Word Embedding using Domain Flow
Interpolation [48.32604585839687]
Previous adversarial approaches have shown promising results in inducing cross-lingual word embedding without parallel data.
We propose to make use of a sequence of intermediate spaces for smooth bridging.
arXiv Detail & Related papers (2022-10-07T04:37:47Z) - Multilingual Extraction and Categorization of Lexical Collocations with
Graph-aware Transformers [86.64972552583941]
We put forward a sequence tagging BERT-based model enhanced with a graph-aware transformer architecture, which we evaluate on the task of collocation recognition in context.
Our results suggest that explicitly encoding syntactic dependencies in the model architecture is helpful, and provide insights on differences in collocation typification in English, Spanish and French.
arXiv Detail & Related papers (2022-05-23T16:47:37Z) - Unsupervised Alignment of Distributional Word Embeddings [0.0]
Cross-domain alignment play a key role in tasks ranging from machine translation to transfer learning.
We show that the proposed approach achieves good performance on the bilingual lexicon induction task across several language pairs.
arXiv Detail & Related papers (2022-03-09T16:39:06Z) - Zero-Shot Cross-Lingual Dependency Parsing through Contextual Embedding
Transformation [7.615096161060399]
Cross-lingual embedding space mapping is usually studied in static word-level embeddings.
We investigate a contextual embedding alignment approach which is sense-level and dictionary-free.
Experiments on zero-shot dependency parsing through the concept-shared space built by our embedding transformation substantially outperform state-of-the-art methods using multilingual embeddings.
arXiv Detail & Related papers (2021-03-03T06:50:43Z) - Multilingual Alignment of Contextual Word Representations [49.42244463346612]
BERT exhibits significantly improved zero-shot performance on XNLI compared to the base model.
We introduce a contextual version of word retrieval and show that it correlates well with downstream zero-shot transfer.
These results support contextual alignment as a useful concept for understanding large multilingual pre-trained models.
arXiv Detail & Related papers (2020-02-10T03:27:21Z) - Robust Cross-lingual Embeddings from Parallel Sentences [65.85468628136927]
We propose a bilingual extension of the CBOW method which leverages sentence-aligned corpora to obtain robust cross-lingual word representations.
Our approach significantly improves crosslingual sentence retrieval performance over all other approaches.
It also achieves parity with a deep RNN method on a zero-shot cross-lingual document classification task.
arXiv Detail & Related papers (2019-12-28T16:18:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.