Sentence Alignment with Parallel Documents Helps Biomedical Machine
Translation
- URL: http://arxiv.org/abs/2104.08588v1
- Date: Sat, 17 Apr 2021 16:09:30 GMT
- Title: Sentence Alignment with Parallel Documents Helps Biomedical Machine
Translation
- Authors: Shengxuan Luo, Huaiyuan Ying, Sheng Yu
- Abstract summary: This work presents a new unsupervised sentence alignment method and explores features in training biomedical neural machine translation (NMT) systems.
We use a simple but effective way to build bilingual word embeddings to evaluate bilingual word similarity.
The proposed method achieved high accuracy in both 1-to-1 and many-to-many cases.
- Score: 0.5430741734728369
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The existing neural machine translation system has achieved near human-level
performance in general domain in some languages, but the lack of parallel
corpora poses a key problem in specific domains. In biomedical domain, the
parallel corpus is less accessible. This work presents a new unsupervised
sentence alignment method and explores features in training biomedical neural
machine translation (NMT) systems. We use a simple but effective way to build
bilingual word embeddings (BWEs) to evaluate bilingual word similarity and
transferred the sentence alignment problem into an extended earth mover's
distance (EMD) problem. The proposed method achieved high accuracy in both
1-to-1 and many-to-many cases. Pre-training in general domain, the larger
in-domain dataset and n-to-m sentence pairs benefit the NMT model. Fine-tuning
in domain corpus helps the translation model learns more terminology and fits
the in-domain style of text.
Related papers
- Exploiting Language Relatedness in Machine Translation Through Domain
Adaptation Techniques [3.257358540764261]
We present a novel approach of using a scaled similarity score of sentences, especially for related languages based on a 5-gram KenLM language model.
Our approach succeeds in increasing 2 BLEU point on multi-domain approach, 3 BLEU point on fine-tuning for NMT and 2 BLEU point on iterative back-translation approach.
arXiv Detail & Related papers (2023-03-03T09:07:30Z) - Domain Mismatch Doesn't Always Prevent Cross-Lingual Transfer Learning [51.232774288403114]
Cross-lingual transfer learning has been surprisingly effective in zero-shot cross-lingual classification.
We show that a simple regimen can overcome much of the effect of domain mismatch in cross-lingual transfer.
arXiv Detail & Related papers (2022-11-30T01:24:33Z) - DICTDIS: Dictionary Constrained Disambiguation for Improved NMT [50.888881348723295]
We present DictDis, a lexically constrained NMT system that disambiguates between multiple candidate translations derived from dictionaries.
We demonstrate the utility of DictDis via extensive experiments on English-Hindi and English-German sentences in a variety of domains including regulatory, finance, engineering.
arXiv Detail & Related papers (2022-10-13T13:04:16Z) - Non-Parametric Domain Adaptation for End-to-End Speech Translation [72.37869362559212]
End-to-End Speech Translation (E2E-ST) has received increasing attention due to the potential of its less error propagation, lower latency, and fewer parameters.
We propose a novel non-parametric method that leverages domain-specific text translation corpus to achieve domain adaptation for the E2E-ST system.
arXiv Detail & Related papers (2022-05-23T11:41:02Z) - Non-Parametric Unsupervised Domain Adaptation for Neural Machine
Translation [61.27321597981737]
$k$NN-MT has shown the promising capability of directly incorporating the pre-trained neural machine translation (NMT) model with domain-specific token-level $k$-nearest-neighbor retrieval.
We propose a novel framework that directly uses in-domain monolingual sentences in the target language to construct an effective datastore for $k$-nearest-neighbor retrieval.
arXiv Detail & Related papers (2021-09-14T11:50:01Z) - Phrase-level Active Learning for Neural Machine Translation [107.28450614074002]
We propose an active learning setting where we can spend a given budget on translating in-domain data.
We select both full sentences and individual phrases from unlabelled data in the new domain for routing to human translators.
In a German-English translation task, our active learning approach achieves consistent improvements over uncertainty-based sentence selection methods.
arXiv Detail & Related papers (2021-06-21T19:20:42Z) - Improving Lexically Constrained Neural Machine Translation with
Source-Conditioned Masked Span Prediction [6.46964825569749]
In this paper, we tackle a more challenging setup consisting of domain-specific corpora with much longer n-gram and highly specialized terms.
To encourage span-level representations in generation, we additionally impose a source-sentence conditioned masked span prediction loss in the decoder.
Experimental results on three domain-specific corpora in two language pairs demonstrate that the proposed training scheme can improve the performance of existing lexically constrained methods.
arXiv Detail & Related papers (2021-05-12T08:11:33Z) - Domain Adaptation and Multi-Domain Adaptation for Neural Machine
Translation: A Survey [9.645196221785694]
We focus on robust approaches to domain adaptation for Neural Machine Translation (NMT) models.
In particular, we look at the case where a system may need to translate sentences from multiple domains.
We highlight the benefits of domain adaptation and multi-domain adaptation techniques to other lines of NMT research.
arXiv Detail & Related papers (2021-04-14T16:21:37Z) - A Simple Baseline to Semi-Supervised Domain Adaptation for Machine
Translation [73.3550140511458]
State-of-the-art neural machine translation (NMT) systems are data-hungry and perform poorly on new domains with no supervised data.
We propose a simple but effect approach to the semi-supervised domain adaptation scenario of NMT.
This approach iteratively trains a Transformer-based NMT model via three training objectives: language modeling, back-translation, and supervised translation.
arXiv Detail & Related papers (2020-01-22T16:42:06Z) - Urdu-English Machine Transliteration using Neural Networks [0.0]
We present transliteration technique based on Expectation Maximization (EM) which is un-supervised and language independent.
System learns the pattern and out-of-vocabulary words from parallel corpus and there is no need to train it on transliteration corpus explicitly.
arXiv Detail & Related papers (2020-01-12T17:30:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.