Related papers: Facilitating Terminology Translation with Target Lemma Annotations

Facilitating Terminology Translation with Target Lemma Annotations

URL: http://arxiv.org/abs/2101.10035v1
Date: Mon, 25 Jan 2021 12:07:20 GMT
Title: Facilitating Terminology Translation with Target Lemma Annotations
Authors: Toms Bergmanis and M\=arcis Pinnis
Abstract summary: We train machine translation systems using a source-side data augmentation method that annotates randomly selected source language words with their target language lemmas. Experiments on terminology translation into the morphologically complex Baltic and Uralic languages show an improvement of up to 7 BLEU points over baseline systems. Results of the human evaluation indicate a 47.7% absolute improvement over the previous work in term translation accuracy when translating into Latvian.
Score: 4.492630871726495
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Most of the recent work on terminology integration in machine translation has assumed that terminology translations are given already inflected in forms that are suitable for the target language sentence. In day-to-day work of professional translators, however, it is seldom the case as translators work with bilingual glossaries where terms are given in their dictionary forms; finding the right target language form is part of the translation process. We argue that the requirement for apriori specified target language forms is unrealistic and impedes the practical applicability of previous work. In this work, we propose to train machine translation systems using a source-side data augmentation method that annotates randomly selected source language words with their target language lemmas. We show that systems trained on such augmented data are readily usable for terminology integration in real-life translation scenarios. Our experiments on terminology translation into the morphologically complex Baltic and Uralic languages show an improvement of up to 7 BLEU points over baseline systems with no means for terminology integration and an average improvement of 4 BLEU points over the previous work. Results of the human evaluation indicate a 47.7% absolute improvement over the previous work in term translation accuracy when translating into Latvian.

Related papers

Towards Global AI Inclusivity: A Large-Scale Multilingual Terminology Dataset (GIST) [19.91873751674613]
GIST is a large-scale multilingual AI terminology dataset containing 5K terms extracted from top AI conference papers spanning 2000 to 2023. The terms are translated into Arabic, Chinese, French, Japanese, and Russian using a hybrid framework that combines LLMs for extraction with human expertise for translation. This work aims to address critical gaps in AI terminology resources and fosters global inclusivity and collaboration in AI research.
arXiv Detail & Related papers (2024-12-24T11:50:18Z)
Crossing the Threshold: Idiomatic Machine Translation through Retrieval Augmentation and Loss Weighting [66.02718577386426]
We provide a simple characterization of idiomatic translation and related issues. We conduct a synthetic experiment revealing a tipping point at which transformer-based machine translation models correctly default to idiomatic translations. To improve translation of natural idioms, we introduce two straightforward yet effective techniques.
arXiv Detail & Related papers (2023-10-10T23:47:25Z)
Exploring Linguistic Similarity and Zero-Shot Learning for Multilingual Translation of Dravidian Languages [0.34998703934432673]
We build a single-decoder neural machine translation system for Dravidian-Dravidian multilingual translation. Our model achieves scores within 3 BLEU of large-scale pivot-based models when it is trained on 50% of the language directions.
arXiv Detail & Related papers (2023-08-10T13:38:09Z)
The Best of Both Worlds: Combining Human and Machine Translations for Multilingual Semantic Parsing with Active Learning [50.320178219081484]
We propose an active learning approach that exploits the strengths of both human and machine translations. An ideal utterance selection can significantly reduce the error and bias in the translated data.
arXiv Detail & Related papers (2023-05-22T05:57:47Z)
DEEP: DEnoising Entity Pre-training for Neural Machine Translation [123.6686940355937]
It has been shown that machine translation models usually generate poor translations for named entities that are infrequent in the training corpus. We propose DEEP, a DEnoising Entity Pre-training method that leverages large amounts of monolingual data and a knowledge base to improve named entity translation accuracy within sentences.
arXiv Detail & Related papers (2021-11-14T17:28:09Z)
Improving Multilingual Translation by Representation and Gradient Regularization [82.42760103045083]
We propose a joint approach to regularize NMT models at both representation-level and gradient-level. Our results demonstrate that our approach is highly effective in both reducing off-target translation occurrences and improving zero-shot translation performance.
arXiv Detail & Related papers (2021-09-10T10:52:21Z)
Towards Continual Learning for Multilingual Machine Translation via Vocabulary Substitution [16.939016405962526]
We propose a straightforward vocabulary adaptation scheme to extend the language capacity of multilingual machine translation models. Our approach is suitable for large-scale datasets, applies to distant languages with unseen scripts and incurs only minor degradation on the translation performance for the original language pairs.
arXiv Detail & Related papers (2021-03-11T17:10:21Z)
Verb Knowledge Injection for Multilingual Event Processing [50.27826310460763]
We investigate whether injecting explicit information on verbs' semantic-syntactic behaviour improves the performance of LM-pretrained Transformers. We first demonstrate that injecting verb knowledge leads to performance gains in English event extraction. We then explore the utility of verb adapters for event extraction in other languages.
arXiv Detail & Related papers (2020-12-31T03:24:34Z)
Data Augmentation and Terminology Integration for Domain-Specific Sinhala-English-Tamil Statistical Machine Translation [1.1470070927586016]
Out of vocabulary (OOV) is a problem in the context of Machine Translation (MT) in low-resourced languages. This paper focuses on data augmentation techniques where bilingual lexicon terms are expanded based on case-markers.
arXiv Detail & Related papers (2020-11-05T13:58:32Z)
On the Importance of Word Order Information in Cross-lingual Sequence Labeling [80.65425412067464]
Cross-lingual models that fit into the word order of the source language might fail to handle target languages. We investigate whether making models insensitive to the word order of the source language can improve the adaptation performance in target languages.
arXiv Detail & Related papers (2020-01-30T03:35:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.