A Generalized Constraint Approach to Bilingual Dictionary Induction for
Low-Resource Language Families
- URL: http://arxiv.org/abs/2010.02395v1
- Date: Mon, 5 Oct 2020 23:41:04 GMT
- Title: A Generalized Constraint Approach to Bilingual Dictionary Induction for
Low-Resource Language Families
- Authors: Arbi Haza Nasution, Yohei Murakami, Toru Ishida
- Abstract summary: We propose constraint-based bilingual lexicon induction for closely-related languages.
We identify cognate synonyms to obtain many-to-many translation pairs.
- Score: 1.0312968200748118
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The lack or absence of parallel and comparable corpora makes bilingual
lexicon extraction a difficult task for low-resource languages. The pivot
language and cognate recognition approaches have been proven useful for
inducing bilingual lexicons for such languages. We propose constraint-based
bilingual lexicon induction for closely-related languages by extending
constraints from the recent pivot-based induction technique and further
enabling multiple symmetry assumption cycles to reach many more cognates in the
transgraph. We further identify cognate synonyms to obtain many-to-many
translation pairs. This paper utilizes four datasets: one Austronesian
low-resource language and three Indo-European high-resource languages. We use
three constraint-based methods from our previous work, the Inverse Consultation
method and translation pairs generated from the Cartesian product of input
dictionaries as baselines. We evaluate our result using the metrics of
precision, recall and F-score. Our customizable approach allows the user to
conduct cross-validation to predict the optimal hyperparameters (cognate
threshold and cognate synonym threshold) with various combinations of
heuristics and the number of symmetry assumption cycles to gain the highest
F-score. Our proposed methods have statistically significant improvement of
precision and F-score compared to our previous constraint-based methods. The
results show that our method demonstrates the potential to complement other
bilingual dictionary creation methods like word alignment models using parallel
corpora for high-resource languages while well handling low-resource languages.
Related papers
- Improving Multi-lingual Alignment Through Soft Contrastive Learning [9.454626745893798]
We propose a novel method to align multi-lingual embeddings based on the similarity of sentences measured by a pre-trained mono-lingual embedding model.
Given translation sentence pairs, we train a multi-lingual model in a way that the similarity between cross-lingual embeddings follows the similarity of sentences measured at the mono-lingual teacher model.
arXiv Detail & Related papers (2024-05-25T09:46:07Z) - Optimal Transport Posterior Alignment for Cross-lingual Semantic Parsing [68.47787275021567]
Cross-lingual semantic parsing transfers parsing capability from a high-resource language (e.g., English) to low-resource languages with scarce training data.
We propose a new approach to cross-lingual semantic parsing by explicitly minimizing cross-lingual divergence between latent variables using Optimal Transport.
arXiv Detail & Related papers (2023-07-09T04:52:31Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - CROP: Zero-shot Cross-lingual Named Entity Recognition with Multilingual
Labeled Sequence Translation [113.99145386490639]
Cross-lingual NER can transfer knowledge between languages via aligned cross-lingual representations or machine translation results.
We propose a Cross-lingual Entity Projection framework (CROP) to enable zero-shot cross-lingual NER.
We adopt a multilingual labeled sequence translation model to project the tagged sequence back to the target language and label the target raw sentence.
arXiv Detail & Related papers (2022-10-13T13:32:36Z) - Exposing Cross-Lingual Lexical Knowledge from Multilingual Sentence
Encoders [85.80950708769923]
We probe multilingual language models for the amount of cross-lingual lexical knowledge stored in their parameters, and compare them against the original multilingual LMs.
We also devise a novel method to expose this knowledge by additionally fine-tuning multilingual models.
We report substantial gains on standard benchmarks.
arXiv Detail & Related papers (2022-04-30T13:23:16Z) - Inducing Language-Agnostic Multilingual Representations [61.97381112847459]
Cross-lingual representations have the potential to make NLP techniques available to the vast majority of languages in the world.
We examine three approaches for this: (i) re-aligning the vector spaces of target languages to a pivot source language; (ii) removing language-specific means and variances, which yields better discriminativeness of embeddings as a by-product; and (iii) increasing input similarity across languages by removing morphological contractions and sentence reordering.
arXiv Detail & Related papers (2020-08-20T17:58:56Z) - Transfer learning and subword sampling for asymmetric-resource
one-to-many neural translation [14.116412358534442]
Methods for improving neural machine translation for low-resource languages are reviewed.
Tests are carried out on three artificially restricted translation tasks and one real-world task.
Experiments show positive effects especially for scheduled multi-task learning, denoising autoencoder, and subword sampling.
arXiv Detail & Related papers (2020-04-08T14:19:05Z) - Refinement of Unsupervised Cross-Lingual Word Embeddings [2.4366811507669124]
Cross-lingual word embeddings aim to bridge the gap between high-resource and low-resource languages.
We propose a self-supervised method to refine the alignment of unsupervised bilingual word embeddings.
arXiv Detail & Related papers (2020-02-21T10:39:53Z) - Robust Cross-lingual Embeddings from Parallel Sentences [65.85468628136927]
We propose a bilingual extension of the CBOW method which leverages sentence-aligned corpora to obtain robust cross-lingual word representations.
Our approach significantly improves crosslingual sentence retrieval performance over all other approaches.
It also achieves parity with a deep RNN method on a zero-shot cross-lingual document classification task.
arXiv Detail & Related papers (2019-12-28T16:18:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.