Low-resource Bilingual Dialect Lexicon Induction with Large Language
Models
- URL: http://arxiv.org/abs/2304.09957v1
- Date: Wed, 19 Apr 2023 20:20:41 GMT
- Title: Low-resource Bilingual Dialect Lexicon Induction with Large Language
Models
- Authors: Ekaterina Artemova and Barbara Plank
- Abstract summary: We present an analysis of the bilingual lexicon induction pipeline for German and two of its dialects, Bavarian and Alemannic.
This setup poses several challenges, including the scarcity of resources, the relatedness of the languages, and the lack of standardization in the orthography of dialects.
- Score: 24.080565202390314
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Bilingual word lexicons are crucial tools for multilingual natural language
understanding and machine translation tasks, as they facilitate the mapping of
words in one language to their synonyms in another language. To achieve this,
numerous papers have explored bilingual lexicon induction (BLI) in
high-resource scenarios, using a typical pipeline consisting of two
unsupervised steps: bitext mining and word alignment, both of which rely on
pre-trained large language models~(LLMs).
In this paper, we present an analysis of the BLI pipeline for German and two
of its dialects, Bavarian and Alemannic. This setup poses several unique
challenges, including the scarcity of resources, the relatedness of the
languages, and the lack of standardization in the orthography of dialects. To
evaluate the BLI outputs, we analyze them with respect to word frequency and
pairwise edit distance. Additionally, we release two evaluation datasets
comprising 1,500 bilingual sentence pairs and 1,000 bilingual word pairs. They
were manually judged for their semantic similarity for each Bavarian-German and
Alemannic-German language pair.
Related papers
- Multilingual Sentence Transformer as A Multilingual Word Aligner [15.689680887384847]
We investigate whether multilingual sentence Transformer LaBSE is a strong multilingual word aligner.
Experiment results on seven language pairs show that our best aligner outperforms previous state-of-the-art models of all varieties.
Our aligner supports different language pairs in a single model, and even achieves new state-of-the-art on zero-shot language pairs that does not appear in the finetuning process.
arXiv Detail & Related papers (2023-01-28T09:28:55Z) - Massively Multilingual Lexical Specialization of Multilingual
Transformers [18.766379322798837]
We show that massively multilingual lexical specialization brings substantial gains in two standard cross-lingual lexical tasks.
We observe gains for languages unseen in specialization, indicating that multilingual lexical specialization enables generalization to languages with no lexical constraints.
arXiv Detail & Related papers (2022-08-01T17:47:03Z) - Exposing Cross-Lingual Lexical Knowledge from Multilingual Sentence
Encoders [85.80950708769923]
We probe multilingual language models for the amount of cross-lingual lexical knowledge stored in their parameters, and compare them against the original multilingual LMs.
We also devise a novel method to expose this knowledge by additionally fine-tuning multilingual models.
We report substantial gains on standard benchmarks.
arXiv Detail & Related papers (2022-04-30T13:23:16Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - Anchor-based Bilingual Word Embeddings for Low-Resource Languages [76.48625630211943]
Good quality monolingual word embeddings (MWEs) can be built for languages which have large amounts of unlabeled text.
MWEs can be aligned to bilingual spaces using only a few thousand word translation pairs.
This paper proposes a new approach for building BWEs in which the vector space of the high resource source language is used as a starting point.
arXiv Detail & Related papers (2020-10-23T19:17:00Z) - Knowledge Distillation for Multilingual Unsupervised Neural Machine
Translation [61.88012735215636]
Unsupervised neural machine translation (UNMT) has recently achieved remarkable results for several language pairs.
UNMT can only translate between a single language pair and cannot produce translation results for multiple language pairs at the same time.
In this paper, we empirically introduce a simple method to translate between thirteen languages using a single encoder and a single decoder.
arXiv Detail & Related papers (2020-04-21T17:26:16Z) - Investigating Language Impact in Bilingual Approaches for Computational
Language Documentation [28.838960956506018]
This paper investigates how the choice of translation language affects the posterior documentation work.
We create 56 bilingual pairs that we apply to the task of low-resource unsupervised word segmentation and alignment.
Our results suggest that incorporating clues into the neural models' input representation increases their translation and alignment quality.
arXiv Detail & Related papers (2020-03-30T10:30:34Z) - Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual
Lexical Semantic Similarity [67.36239720463657]
Multi-SimLex is a large-scale lexical resource and evaluation benchmark covering datasets for 12 diverse languages.
Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs.
Owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity datasets.
arXiv Detail & Related papers (2020-03-10T17:17:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.