AlphaMWE: Construction of Multilingual Parallel Corpora with MWE
Annotations
- URL: http://arxiv.org/abs/2011.03783v1
- Date: Sat, 7 Nov 2020 14:28:54 GMT
- Title: AlphaMWE: Construction of Multilingual Parallel Corpora with MWE
Annotations
- Authors: Lifeng Han, Gareth Jones, Alan Smeaton
- Abstract summary: We present the construction of multilingual parallel corpora with annotation of multiword expressions (MWEs)
The languages covered include English, Chinese, Polish, and German.
We present a categorisation of the error types encountered by MT systems in performing MWE related translation.
- Score: 5.8010446129208155
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we present the construction of multilingual parallel corpora
with annotation of multiword expressions (MWEs). MWEs include verbal MWEs
(vMWEs) defined in the PARSEME shared task that have a verb as the head of the
studied terms. The annotated vMWEs are also bilingually and multilingually
aligned manually. The languages covered include English, Chinese, Polish, and
German. Our original English corpus is taken from the PARSEME shared task in
2018. We performed machine translation of this source corpus followed by human
post editing and annotation of target MWEs. Strict quality control was applied
for error limitation, i.e., each MT output sentence received first manual post
editing and annotation plus second manual quality rechecking. One of our
findings during corpora preparation is that accurate translation of MWEs
presents challenges to MT systems. To facilitate further MT research, we
present a categorisation of the error types encountered by MT systems in
performing MWE related translation. To acquire a broader view of MT issues, we
selected four popular state-of-the-art MT models for comparisons namely:
Microsoft Bing Translator, GoogleMT, Baidu Fanyi and DeepL MT. Because of the
noise removal, translation post editing and MWE annotation by human
professionals, we believe our AlphaMWE dataset will be an asset for
cross-lingual and multilingual research, such as MT and information extraction.
Our multilingual corpora are available as open access at
github.com/poethan/AlphaMWE.
Related papers
- On Translating Technical Terminology: A Translation Workflow for
Machine-Translated Acronyms [3.053989095162017]
We find that an important step is being missed: the translation of technical terms, specifically acronyms.
Some state-of-the art machine translation systems like Google Translate which are publicly available can be erroneous when dealing with acronyms.
We propose an additional step to the SL-TL (FR-EN) translation workflow where we first offer a new acronym corpus for public consumption and then experiment with a search-based thresholding algorithm.
arXiv Detail & Related papers (2024-09-26T15:18:34Z) - Towards Zero-Shot Multimodal Machine Translation [64.9141931372384]
We propose a method to bypass the need for fully supervised data to train multimodal machine translation systems.
Our method, called ZeroMMT, consists in adapting a strong text-only machine translation (MT) model by training it on a mixture of two objectives.
To prove that our method generalizes to languages with no fully supervised training data available, we extend the CoMMuTE evaluation dataset to three new languages: Arabic, Russian and Chinese.
arXiv Detail & Related papers (2024-07-18T15:20:31Z) - Revisiting Machine Translation for Cross-lingual Classification [91.43729067874503]
Most research in the area focuses on the multilingual models rather than the Machine Translation component.
We show that, by using a stronger MT system and mitigating the mismatch between training on original text and running inference on machine translated text, translate-test can do substantially better than previously assumed.
arXiv Detail & Related papers (2023-05-23T16:56:10Z) - Discourse Centric Evaluation of Machine Translation with a Densely
Annotated Parallel Corpus [82.07304301996562]
This paper presents a new dataset with rich discourse annotations, built upon the large-scale parallel corpus BWB introduced in Jiang et al.
We investigate the similarities and differences between the discourse structures of source and target languages.
We discover that MT outputs differ fundamentally from human translations in terms of their latent discourse structures.
arXiv Detail & Related papers (2023-05-18T17:36:41Z) - ParroT: Translating during Chat using Large Language Models tuned with
Human Translation and Feedback [90.20262941911027]
ParroT is a framework to enhance and regulate the translation abilities during chat.
Specifically, ParroT reformulates translation data into the instruction-following style.
We propose three instruction types for finetuning ParroT models, including translation instruction, contrastive instruction, and error-guided instruction.
arXiv Detail & Related papers (2023-04-05T13:12:00Z) - Tackling Ambiguity with Images: Improved Multimodal Machine Translation
and Contrastive Evaluation [72.6667341525552]
We present a new MMT approach based on a strong text-only MT model, which uses neural adapters and a novel guided self-attention mechanism.
We also introduce CoMMuTE, a Contrastive Multimodal Translation Evaluation set of ambiguous sentences and their possible translations.
Our approach obtains competitive results compared to strong text-only models on standard English-to-French, English-to-German and English-to-Czech benchmarks.
arXiv Detail & Related papers (2022-12-20T10:18:18Z) - DivEMT: Neural Machine Translation Post-Editing Effort Across
Typologically Diverse Languages [5.367993194110256]
DivEMT is the first publicly available post-editing study of Neural Machine Translation (NMT) over a typologically diverse set of target languages.
We assess the impact on translation productivity of two state-of-the-art NMT systems, namely: Google Translate and the open-source multilingual model mBART50.
arXiv Detail & Related papers (2022-05-24T17:22:52Z) - SJTU-NICT's Supervised and Unsupervised Neural Machine Translation
Systems for the WMT20 News Translation Task [111.91077204077817]
We participated in four translation directions of three language pairs: English-Chinese, English-Polish, and German-Upper Sorbian.
Based on different conditions of language pairs, we have experimented with diverse neural machine translation (NMT) techniques.
In our submissions, the primary systems won the first place on English to Chinese, Polish to English, and German to Upper Sorbian translation directions.
arXiv Detail & Related papers (2020-10-11T00:40:05Z) - What's the Difference Between Professional Human and Machine
Translation? A Blind Multi-language Study on Domain-specific MT [0.6091702876917281]
Machine translation (MT) has been shown to produce a number of errors that require human post-editing, but the extent to which professional human translation (HT) contains such errors has not yet been compared.
We compile pre-translated documents in which MT and HT are interleaved, and ask professional translators to flag errors and post-edit these documents in a blind evaluation.
We find that the post-editing effort for MT segments is only higher in two out of three language pairs, and that the number of segments with wrong terminology, omissions, and typographical problems is similar in HT.
arXiv Detail & Related papers (2020-06-08T17:55:14Z) - MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel
Corpora [14.105783620789667]
Multi-word expressions (MWEs) are a hot topic in research in natural language processing (NLP)
The availability of bilingual or multi-lingual MWE corpora is very limited.
We present a collection of 3,159,226 and 143,042 bilingual MWE pairs for German-English and Chinese-English respectively after filtering.
arXiv Detail & Related papers (2020-05-21T11:46:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.