Improving Rare Word Translation With Dictionaries and Attention Masking
- URL: http://arxiv.org/abs/2408.09075v2
- Date: Tue, 3 Sep 2024 16:47:09 GMT
- Title: Improving Rare Word Translation With Dictionaries and Attention Masking
- Authors: Kenneth J. Sible, David Chiang,
- Abstract summary: We propose appending definitions from a bilingual dictionary to source sentences and using attention masking to link together rare words with their definitions.
We find that including definitions for rare words improves performance by up to 1.0 BLEU and 1.6 MacroF1.
- Score: 8.908747084128397
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In machine translation, rare words continue to be a problem for the dominant encoder-decoder architecture, especially in low-resource and out-of-domain translation settings. Human translators solve this problem with monolingual or bilingual dictionaries. In this paper, we propose appending definitions from a bilingual dictionary to source sentences and using attention masking to link together rare words with their definitions. We find that including definitions for rare words improves performance by up to 1.0 BLEU and 1.6 MacroF1.
Related papers
- Decoupled Vocabulary Learning Enables Zero-Shot Translation from Unseen Languages [55.157295899188476]
neural machine translation systems learn to map sentences of different languages into a common representation space.
In this work, we test this hypothesis by zero-shot translating from unseen languages.
We demonstrate that this setup enables zero-shot translation from entirely unseen languages.
arXiv Detail & Related papers (2024-08-05T07:58:58Z) - Modular Sentence Encoders: Separating Language Specialization from Cross-Lingual Alignment [50.80949663719335]
Training for cross-lingual alignment of sentence embeddings distorts the optimal monolingual structure of semantic spaces of individual languages.
We train language-specific sentence encoders to avoid negative interference between languages.
We then align all non-English monolingual encoders to the English encoder by training a cross-lingual alignment adapter on top of each.
arXiv Detail & Related papers (2024-07-20T13:56:39Z) - Beyond Shared Vocabulary: Increasing Representational Word Similarities
across Languages for Multilingual Machine Translation [9.794506112999823]
In this paper, we define word-level information transfer pathways via word equivalence classes and rely on graph networks to fuse word embeddings across languages.
Our experiments demonstrate the advantages of our approach: 1) embeddings of words with similar meanings are better aligned across languages, 2) our method achieves consistent BLEU improvements of up to 2.3 points for high- and low-resource MNMT, and 3) less than 1.0% additional trainable parameters are required with a limited increase in computational costs.
arXiv Detail & Related papers (2023-05-23T16:11:00Z) - Automatically Creating a Large Number of New Bilingual Dictionaries [2.363388546004777]
This paper proposes approaches to automatically create a large number of new bilingual dictionaries for low-resource languages.
Our algorithms produce translations of words in a source language to plentiful target languages using available Wordnets and a machine translator.
arXiv Detail & Related papers (2022-08-12T04:25:23Z) - Creating Lexical Resources for Endangered Languages [2.363388546004777]
Our algorithms construct bilingual dictionaries and multilingual thesauruses using public Wordnets and a machine translator (MT)
Since our work relies on only one bilingual dictionary between an endangered language and an "intermediate helper" language, it is applicable to languages that lack many existing resources.
arXiv Detail & Related papers (2022-08-08T02:31:28Z) - BitextEdit: Automatic Bitext Editing for Improved Low-Resource Machine
Translation [53.55009917938002]
We propose to refine the mined bitexts via automatic editing.
Experiments demonstrate that our approach successfully improves the quality of CCMatrix mined bitext for 5 low-resource language-pairs and 10 translation directions by up to 8 BLEU points.
arXiv Detail & Related papers (2021-11-12T16:00:39Z) - Bilingual Lexicon Induction via Unsupervised Bitext Construction and
Word Alignment [49.3253280592705]
We show it is possible to produce much higher quality lexicons with methods that combine bitext mining and unsupervised word alignment.
Our final model outperforms the state of the art on the BUCC 2020 shared task by 14 $F_1$ points averaged over 12 language pairs.
arXiv Detail & Related papers (2021-01-01T03:12:42Z) - Data Augmentation and Terminology Integration for Domain-Specific
Sinhala-English-Tamil Statistical Machine Translation [1.1470070927586016]
Out of vocabulary (OOV) is a problem in the context of Machine Translation (MT) in low-resourced languages.
This paper focuses on data augmentation techniques where bilingual lexicon terms are expanded based on case-markers.
arXiv Detail & Related papers (2020-11-05T13:58:32Z) - Anchor-based Bilingual Word Embeddings for Low-Resource Languages [76.48625630211943]
Good quality monolingual word embeddings (MWEs) can be built for languages which have large amounts of unlabeled text.
MWEs can be aligned to bilingual spaces using only a few thousand word translation pairs.
This paper proposes a new approach for building BWEs in which the vector space of the high resource source language is used as a starting point.
arXiv Detail & Related papers (2020-10-23T19:17:00Z) - Look It Up: Bilingual Dictionaries Improve Neural Machine Translation [17.385945558427863]
We describe a new method for "attaching" dictionary definitions to rare words so that the network can learn the best way to use them.
We demonstrate improvements of up to 1.8 BLEU using bilingual dictionaries.
arXiv Detail & Related papers (2020-10-12T19:53:08Z) - Speakers Fill Lexical Semantic Gaps with Context [65.08205006886591]
We operationalise the lexical ambiguity of a word as the entropy of meanings it can take.
We find significant correlations between our estimate of ambiguity and the number of synonyms a word has in WordNet.
This suggests that, in the presence of ambiguity, speakers compensate by making contexts more informative.
arXiv Detail & Related papers (2020-10-05T17:19:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.